FoLiA: Format for Linguistic Annotation



Similar documents
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF

Schema documentation for types1.2.xsd

WebLicht: Web-based LRT services for German

InternetVista Web scenario documentation

CLAM: Quickly deploy NLP command-line tools on the web

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Your boldest wishes concerning online corpora: OpenSoNaR and you

Privilege Escalation via Antivirus Software

XML: extensible Markup Language. Anabel Fraga

Internationalization Tag Set 1.0 A New Standard for Internationalization and Localization of XML

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Moving Enterprise Applications into VoiceXML. May 2002

The CroCo Translation Archive

How To Create A Clarin Metadata Infrastructure

Extensible Markup Language (XML): Essentials for Climatologists

Special Topics in Computer Science

CLARIN-NL Third Call: Closed Call

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Dutch Parallel Corpus

Page: 1. Merging XML files: a new approach providing intelligent merge of XML data sets

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

A Conceptual Framework of Online Natural Language Processing Pipeline Application

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

My IC Customizer: Descriptors of Skins and Webapps for third party User Guide

An XML Based Data Exchange Model for Power System Studies

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Shallow Parsing with Apache UIMA

Novel Data Extraction Language for Structured Log Analysis

XML Processing and Web Services. Chapter 17

GATE Mímir and cloud services. Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation

Exchanger XML Editor - Canonicalization and XML Digital Signatures

StreamServe Persuasion SP5 Document Broker Plus

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Making Content Editable. Create re-usable templates with total control over the sections you can (and more importantly can't) change.

SHared Access Research Ecosystem (SHARE)

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Natural Language to Relational Query by Using Parsing Compiler

Fast track to HTML & CSS 101 (Web Design)

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

BASICS OF WEB DESIGN CHAPTER 2 HTML BASICS KEY CONCEPTS COPYRIGHT 2013 TERRY ANN MORRIS, ED.D

Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

Integration of Hotel Property Management Systems (HPMS) with Global Internet Reservation Systems

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

DEPENDENCY PARSING JOAKIM NIVRE

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

10CS73:Web Programming

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

WIRIS quizzes web services Getting started with PHP and Java

US Secure Web API. Version 1.6.0

Brill s rule-based PoS tagger

Data Deduplication in Slovak Corpora

Using the BNC to create and develop educational materials and a website for learners of English

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

PoS-tagging Italian texts with CORISTagger

Getting Started Guide for Developing tibbr Apps

Statistical Machine Translation

Automated Extraction of Security Policies from Natural-Language Software Documents

Federal Employee Viewpoint Survey Online Reporting and Analysis Tool

Cloud Monitoring and Auditing with CADF (Cloud Auditing and Data Federation)

Technical Report. The KNIME Text Processing Feature:

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

XES. Standard Definition. Where innovation starts. Den Dolech 2, 5612 AZ Eindhoven P.O. Box 513, 5600 MB Eindhoven The Netherlands

How RAI's Hyper Media News aggregation system keeps staff on top of the news

GetLibraryUserOrderList

The Web Web page Links 16-3

HTSQL is a comprehensive navigational query language for relational databases.

Web Services Technologies

Terms and Definitions for CMS Administrators, Architects, and Developers

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

Learning Translation Rules from Bilingual English Filipino Corpus

How is it helping? PragmatiQa XOData : Overview with an Example. P a g e Doc Version : 1.3

Dynamic Styling in Web Development

T Network Application Frameworks and XML Web Services and WSDL Tancred Lindholm

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

The Prolog Interface to the Unstructured Information Management Architecture

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Introduction to Web Services

HTML Web Page That Shows Its Own Source Code

Integrating Annotation Tools into UIMA for Interoperability

Developing XML Solutions with JavaServer Pages Technology

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Transcription:

Maarten van Gompel Radboud University Nijmegen 20-01-2012

Introduction Introduction What is FoLiA? Generalised XML-based format for a wide variety of linguistic annotation Characteristics Generalised paradigm Single universal paradigm applicable to all kinds of annotations; as few ad-hoc provisions as possible. Not commited to any label set. Extensible Unsupported annotation types can be added fairly easily. Expressive Verbose expression of annotations, their annotators, timestamps, etc... Moreover, support for alternative annotations. Formalised Validation on two levels: shallow and deep. The latter validates the used label set and allows for links with for instance ISOcat.

Features Intended Applications as a corpus storage format as a language resource exchange format Properties One document, one text, one XML file containing all annotations. Annotation types and label sets must be declared in the document header Document metadata can be either included in the file (limited), or by reference to external CMDI or IMDI (preferred)

Motivation & Dissemination Why (yet) another format? Many ad-hoc and legacy annotation formats (CGN, Tadpole column format) Many theoretic and specialised annotation formats with limited scope (LAF, SynAF, MAF, TEI) Bottom-up rather than top-down development: FoLiA arose from practical need, immediately developed alongside practical programming libraries and applications. De-facto-standard: D-COI XML Dissemination SoNaR TTNWW DutchSemCor Valkuil.net Frog & Ucto

Annotation Categories Paradigm Paradigm: Annotation Categories Four categories of annotation: Structure Annotation - Elements denoting document structure E.g: Divisions, Header, Paragraphs, Sentences, Lists, Figures, Gaps, Quote Token Annotation - Linguistic Annotations pertaining to a single token (inline annotation) E.g: Part of Speech Annotation, Lemma Annotation, Lexical Semantic Sense Annotation Span Annotation - Linguistic Annotations spanning over multiple tokens (standoff annotation) E.g: Syntactic Parses, Dependency Relations, Entities/Multi-word Units Subtoken Annotation - Linguistic Annotations pertaining to a subpart of a token (standoff annotation) E.g: Morphology

Attributes

Document skeleton Format <? xml v e r s i o n=" 1.0 " e n c o d i n g="utf -8"?> <FoLiA xmlns=" http: // ilk. uvt.nl/ FoLiA " x m l n s : x s i=" http: // www.w3.org /2001/ XMLSchema - instance " x m l : i d=" example "> <metadata t y p e=" cmdi " s r c=" example. cmdi "> <a n n o t a t i o n s>... </ a n n o t a t i o n s> </ metadata> <t e x t x m l : i d=" example. text ">... </ t e x t> </ FoLiA>

Structure Annotation Example <p x m l : i d=" TEST.p.1"> <t>this i s a t e s t. I t has two s e n t e n c e s.</ t> <s x m l : i d=" TEST.p.1. s.1"> <t>this i s a t e s t.</ t> <w x m l : i d=" TEST.p.1. s.1. w.1"><t>this</ t></w> <w x m l : i d=" TEST.p.1. s.1. w.2"><t> i s</ t></w>.. </ s><s x m l : i d=" TEST.p.1. s.2">... </ s> </p> Characteristics of basic structure 1 Structure Elements: Paragraphs, Sentences, Words/Tokens 2 More: Division, Head, List, ListItem, Figure, Gap... 3 Unique identifiers 4 Text content element (t) holds actual text.

Structure Annotation Token Annotation Token annotation occurs within the scope of a word/token (w) element. Example PoS and Lemma Annotation: <w x m l : i d=" example.p.1. s.1. w.2"> <t>boot</ t> <pos s e t=" cgn " c l a s s="n" a n n o t a t o r=" Maarten van Gompel " a n n o t a t o r t y p e=" manual " /> <lemma s e t=" english - lemmas " c l a s s=" boot " /> <s e n s e s e t=" cornetto " c l a s s=" D_N12345 " a n n o t a t o r=" supwsd1 " a n n o t a t o r t y p e=" auto " c o n f i d e n c e=" 0.65 " /> </w>

Structure Annotation Token Annotations with subsets For all annotation types; subsets can be used for more refined annotations. Example <w x m l : i d=" example.p.1. s.1. w.2"> <t>boot</ t> <pos s e t=" cgn " c l a s s="n( soort,ev, basis,zijn, stan )"> <f e a t s u b s e t=" head " c l a s s="n" /> <f e a t s u b s e t=" ntype " c l a s s=" soort " /> <f e a t s u b s e t=" number " c l a s s="ev" /> <f e a t s u b s e t=" degree " c l a s s=" basis " /> <f e a t s u b s e t=" gender " c l a s s=" zijd " /> <f e a t s u b s e t=" case " c l a s s=" stan " /> </ pos> </w>

Span Annotation Span Annotation Token Annotation is not sufficient, some annotations span over multiple tokens (not necessarily consecutive) Spanning multiple tokens can produce nesting problems (e.g A(BC)D and AB(CD)) Solution: Span Annotation using standoff notation Properties Applications: Syntactic Parses, Chunking, Dependency Relations, Entities/Multi-Word Units Layers: Each type of span annotation is placed within an annotation layer, annotation layers are usually embedded within sentences (s)) Same paradigm: Set, class, annotator, confidence, etc...

Span Annotation <s x m l : i d=" example.p.1. s.1"> <t>the D a l a i Lama g r e e t e d him.</ t> <w x m l : i d=" example.p.1. s.1. w.1"><t>the</ t></w> <w x m l : i d=" example.p.1. s.1. w.2"><t>d a l a i</ t></w> <w x m l : i d=" example.p.1. s.1. w.3"><t>lama</ t></w> <w x m l : i d=" example.p.1. s.1. w.4"><t>g r e e t e d</ t></w> <w x m l : i d=" example.p.1. s.1. w.5"><t>him</ t></w> <w x m l : i d=" example.p.1. s.1. w.6"><t>.</ t></w> < e n t i t i e s> <e n t i t y x m l : i d=" example.p.1. s.1. entity.1" c l a s s=" person "> <w r e f x m l : i d=" example.p.1. s.1. w.2" /> <w r e f x m l : i d=" example.p.1. s.1. w.3" /> </ e n t i t y> </ e n t i t i e s> </ s>

Span Annotation <s y n t a x> <su x m l : i d=" example.p.1. s.1. su.1" c l a s s="s"> <su x m l : i d=" example.p.1. s.1. su.1 _1" c l a s s="np"> <su x m l : i d=" example.p.1. s.1. su.1 _1_1 " c l a s s=" det "> <w r e f x m l : i d=" example.p.1. s.1. w.1" /> </ su> <su x m l : i d=" example.p.1. s.1. su.1 _1_2 " c l a s s="pn"> <w r e f x m l : i d=" example.p.1. s.1. w.2" /> <w r e f x m l : i d=" example.p.1. s.1. w.3" /> </ su> </ su> </ su> <su x m l : i d=" example.p.1. s.1. su.1 _2" c l a s s="vp"> <su x m l : i d=" example.p.1. s.1. su.1 _1_1 " c l a s s="v"> <w r e f x m l : i d=" example.p.1. s.1. w.4" /> </ su> <su x m l : i d=" example.p.1. s.1. su.1 _1_2 " c l a s s=" pron "> <w r e f x m l : i d=" example.p.1. s.1. w.5" /> </ su> </ su> </ su> </ s y n t a x>

Tools for working with FoLiA Standard XML facilities: XSLT, XPath Python library: pynlpl.formats.folia C++ library: libfolia (Ko van der Sloot) Applications Frog tagger/lemmatisaion/parser suite: FoLiA output (input in later stage). ucto tokeniser: FoLiA input and output. Converters DCOI FoLiA FoLiA CSV (limited)

Conclusion URLs Uniformity: generic framework with simple paradigm, XML based Expressiveness: Ability to encode many kinds of linguistic annotation, including structural annotation, alternatives, and corrections Extensibility: easy to add new annotations with the same paradigm A variety of tools and converters already available! http://ilk.uvt.nl/folia http://github.com/proycon/folia

Questions?

Trade-off Trade-off: Expressivity versus Computing Efficiency FoLiA aims at expressivity rather than computing efficiency. XML and FoLiA overhead: Not ideal for real-time or resource-constrained applications Conversion to less expressive, more efficient, formats.

Annotations Supported Annotations (1/2) FoLiA supports the following linguistic annotations: Part-of-Speech tags (with features) Lemmatisation Domain tagging Lexical semantic sense annotation (used in DutchSemCor) Named Entities / Multi-word units (used in SoNaR) Syntactic Parses Dependency Relations

Annotations Supported Annotations (2/2) FoLiA supports the following linguistic annotations: Chunking Corrections (used in valkuil.net) Morphology Event/Time annotation Phonetic annotation

Declarations Token Annotation All annotations need to be declared in the metadata: Example Default sets and annotator may be predefined at this level <metadata> <a n n o t a t i o n s> <token a n n o t a t i o n /> <pos a n n o t a t i o n s e t=" brown " a n n o t a t o r=" Maarten van Gompel " a n n o t a t o r t y p e=" manual "/> <lemma a n n o t a t i o n /> </ a n n o t a t i o n s> </ metadata>

Alternatives Alternative Token Annotations Annotations of the same type, but different sets need not be alternatives. <w x m l : i d=" example.p.1. s.1. w.2"> <t>l u i d</ t> <pos s e t=" brown " c l a s s="jj" /> <pos s e t=" cgn " c l a s s=" adj " /> </w> There can be only one of the same set though, this is illegal and requires usage of alternatives instead: <w x m l : i d=" example.p.1. s.1. w.2"> <t>l u i d</ t> <pos s e t=" cgn " c l a s s=" adj " /> <pos s e t=" cgn " c l a s s=" adv " /> </w>

Alternatives Alternative Token Annotations Encodes mutually exclusive alternative annotations. Any annotations that are not alternatives are considered selected. <w x m l : i d=" example.p.1. s.1. w.2"> <t>bank</ t> <s e n s e s e t=" wordnet3.0" c l a s s=" bank %1 :17:01: " a n n o t a t o r=" Maarten van Gompel " a n n o t a t o r t y p e=" manual " c o n f i d e n c e=" 0.8 "> s l o p i n g ground near water</ s e n s e> < a l t x m l : i d=" example.p.1. s.1. w.2. alt.1"> <s e n s e s e t=" wordnet3.0" c l a s s=" bank %1 :14:01: " a n n o t a t o r=" WSDsystem " a n n o t a t o r t y p e=" auto " c o n f i d e n c e=" 0.6 "> f i n a n c i a l i n s t i t u t i o n</ s e n s e> </ a l t> </w>

Alternatives Alternative Token Annotations All token annotations grouped as one alternative are considered dependent. Multiple alternatives are always independent: <w x m l : i d=" example.p.1. s.1. w.2"> <t>v l i e g</ t> <pos c l a s s="n" /> <lemma c l a s s=" vlieg " /> < a l t x m l : i d=" example.p.1. s.1. w.2. alt.1"> <pos c l a s s="v" /> <lemma c l a s s=" vliegen " /> </ a l t> </w>