FoLiA: Format for Linguistic Annotation
|
|
|
- Sherilyn Mathews
- 9 years ago
- Views:
Transcription
1 Maarten van Gompel Radboud University Nijmegen
2 Introduction Introduction What is FoLiA? Generalised XML-based format for a wide variety of linguistic annotation Characteristics Generalised paradigm Single universal paradigm applicable to all kinds of annotations; as few ad-hoc provisions as possible. Not commited to any label set. Extensible Unsupported annotation types can be added fairly easily. Expressive Verbose expression of annotations, their annotators, timestamps, etc... Moreover, support for alternative annotations. Formalised Validation on two levels: shallow and deep. The latter validates the used label set and allows for links with for instance ISOcat.
3 Features Intended Applications as a corpus storage format as a language resource exchange format Properties One document, one text, one XML file containing all annotations. Annotation types and label sets must be declared in the document header Document metadata can be either included in the file (limited), or by reference to external CMDI or IMDI (preferred)
4 Motivation & Dissemination Why (yet) another format? Many ad-hoc and legacy annotation formats (CGN, Tadpole column format) Many theoretic and specialised annotation formats with limited scope (LAF, SynAF, MAF, TEI) Bottom-up rather than top-down development: FoLiA arose from practical need, immediately developed alongside practical programming libraries and applications. De-facto-standard: D-COI XML Dissemination SoNaR TTNWW DutchSemCor Valkuil.net Frog & Ucto
5 Annotation Categories Paradigm Paradigm: Annotation Categories Four categories of annotation: Structure Annotation - Elements denoting document structure E.g: Divisions, Header, Paragraphs, Sentences, Lists, Figures, Gaps, Quote Token Annotation - Linguistic Annotations pertaining to a single token (inline annotation) E.g: Part of Speech Annotation, Lemma Annotation, Lexical Semantic Sense Annotation Span Annotation - Linguistic Annotations spanning over multiple tokens (standoff annotation) E.g: Syntactic Parses, Dependency Relations, Entities/Multi-word Units Subtoken Annotation - Linguistic Annotations pertaining to a subpart of a token (standoff annotation) E.g: Morphology
6 Attributes
7 Document skeleton Format <? xml v e r s i o n=" 1.0 " e n c o d i n g="utf -8"?> <FoLiA xmlns=" http: // ilk. uvt.nl/ FoLiA " x m l n s : x s i=" http: // /2001/ XMLSchema - instance " x m l : i d=" example "> <metadata t y p e=" cmdi " s r c=" example. cmdi "> <a n n o t a t i o n s>... </ a n n o t a t i o n s> </ metadata> <t e x t x m l : i d=" example. text ">... </ t e x t> </ FoLiA>
8 Structure Annotation Example <p x m l : i d=" TEST.p.1"> <t>this i s a t e s t. I t has two s e n t e n c e s.</ t> <s x m l : i d=" TEST.p.1. s.1"> <t>this i s a t e s t.</ t> <w x m l : i d=" TEST.p.1. s.1. w.1"><t>this</ t></w> <w x m l : i d=" TEST.p.1. s.1. w.2"><t> i s</ t></w>.. </ s><s x m l : i d=" TEST.p.1. s.2">... </ s> </p> Characteristics of basic structure 1 Structure Elements: Paragraphs, Sentences, Words/Tokens 2 More: Division, Head, List, ListItem, Figure, Gap... 3 Unique identifiers 4 Text content element (t) holds actual text.
9 Structure Annotation Token Annotation Token annotation occurs within the scope of a word/token (w) element. Example PoS and Lemma Annotation: <w x m l : i d=" example.p.1. s.1. w.2"> <t>boot</ t> <pos s e t=" cgn " c l a s s="n" a n n o t a t o r=" Maarten van Gompel " a n n o t a t o r t y p e=" manual " /> <lemma s e t=" english - lemmas " c l a s s=" boot " /> <s e n s e s e t=" cornetto " c l a s s=" D_N12345 " a n n o t a t o r=" supwsd1 " a n n o t a t o r t y p e=" auto " c o n f i d e n c e=" 0.65 " /> </w>
10 Structure Annotation Token Annotations with subsets For all annotation types; subsets can be used for more refined annotations. Example <w x m l : i d=" example.p.1. s.1. w.2"> <t>boot</ t> <pos s e t=" cgn " c l a s s="n( soort,ev, basis,zijn, stan )"> <f e a t s u b s e t=" head " c l a s s="n" /> <f e a t s u b s e t=" ntype " c l a s s=" soort " /> <f e a t s u b s e t=" number " c l a s s="ev" /> <f e a t s u b s e t=" degree " c l a s s=" basis " /> <f e a t s u b s e t=" gender " c l a s s=" zijd " /> <f e a t s u b s e t=" case " c l a s s=" stan " /> </ pos> </w>
11 Span Annotation Span Annotation Token Annotation is not sufficient, some annotations span over multiple tokens (not necessarily consecutive) Spanning multiple tokens can produce nesting problems (e.g A(BC)D and AB(CD)) Solution: Span Annotation using standoff notation Properties Applications: Syntactic Parses, Chunking, Dependency Relations, Entities/Multi-Word Units Layers: Each type of span annotation is placed within an annotation layer, annotation layers are usually embedded within sentences (s)) Same paradigm: Set, class, annotator, confidence, etc...
12 Span Annotation <s x m l : i d=" example.p.1. s.1"> <t>the D a l a i Lama g r e e t e d him.</ t> <w x m l : i d=" example.p.1. s.1. w.1"><t>the</ t></w> <w x m l : i d=" example.p.1. s.1. w.2"><t>d a l a i</ t></w> <w x m l : i d=" example.p.1. s.1. w.3"><t>lama</ t></w> <w x m l : i d=" example.p.1. s.1. w.4"><t>g r e e t e d</ t></w> <w x m l : i d=" example.p.1. s.1. w.5"><t>him</ t></w> <w x m l : i d=" example.p.1. s.1. w.6"><t>.</ t></w> < e n t i t i e s> <e n t i t y x m l : i d=" example.p.1. s.1. entity.1" c l a s s=" person "> <w r e f x m l : i d=" example.p.1. s.1. w.2" /> <w r e f x m l : i d=" example.p.1. s.1. w.3" /> </ e n t i t y> </ e n t i t i e s> </ s>
13 Span Annotation <s y n t a x> <su x m l : i d=" example.p.1. s.1. su.1" c l a s s="s"> <su x m l : i d=" example.p.1. s.1. su.1 _1" c l a s s="np"> <su x m l : i d=" example.p.1. s.1. su.1 _1_1 " c l a s s=" det "> <w r e f x m l : i d=" example.p.1. s.1. w.1" /> </ su> <su x m l : i d=" example.p.1. s.1. su.1 _1_2 " c l a s s="pn"> <w r e f x m l : i d=" example.p.1. s.1. w.2" /> <w r e f x m l : i d=" example.p.1. s.1. w.3" /> </ su> </ su> </ su> <su x m l : i d=" example.p.1. s.1. su.1 _2" c l a s s="vp"> <su x m l : i d=" example.p.1. s.1. su.1 _1_1 " c l a s s="v"> <w r e f x m l : i d=" example.p.1. s.1. w.4" /> </ su> <su x m l : i d=" example.p.1. s.1. su.1 _1_2 " c l a s s=" pron "> <w r e f x m l : i d=" example.p.1. s.1. w.5" /> </ su> </ su> </ su> </ s y n t a x>
14 Tools for working with FoLiA Standard XML facilities: XSLT, XPath Python library: pynlpl.formats.folia C++ library: libfolia (Ko van der Sloot) Applications Frog tagger/lemmatisaion/parser suite: FoLiA output (input in later stage). ucto tokeniser: FoLiA input and output. Converters DCOI FoLiA FoLiA CSV (limited)
15 Conclusion URLs Uniformity: generic framework with simple paradigm, XML based Expressiveness: Ability to encode many kinds of linguistic annotation, including structural annotation, alternatives, and corrections Extensibility: easy to add new annotations with the same paradigm A variety of tools and converters already available!
16 Questions?
17 Trade-off Trade-off: Expressivity versus Computing Efficiency FoLiA aims at expressivity rather than computing efficiency. XML and FoLiA overhead: Not ideal for real-time or resource-constrained applications Conversion to less expressive, more efficient, formats.
18 Annotations Supported Annotations (1/2) FoLiA supports the following linguistic annotations: Part-of-Speech tags (with features) Lemmatisation Domain tagging Lexical semantic sense annotation (used in DutchSemCor) Named Entities / Multi-word units (used in SoNaR) Syntactic Parses Dependency Relations
19 Annotations Supported Annotations (2/2) FoLiA supports the following linguistic annotations: Chunking Corrections (used in valkuil.net) Morphology Event/Time annotation Phonetic annotation
20 Declarations Token Annotation All annotations need to be declared in the metadata: Example Default sets and annotator may be predefined at this level <metadata> <a n n o t a t i o n s> <token a n n o t a t i o n /> <pos a n n o t a t i o n s e t=" brown " a n n o t a t o r=" Maarten van Gompel " a n n o t a t o r t y p e=" manual "/> <lemma a n n o t a t i o n /> </ a n n o t a t i o n s> </ metadata>
21 Alternatives Alternative Token Annotations Annotations of the same type, but different sets need not be alternatives. <w x m l : i d=" example.p.1. s.1. w.2"> <t>l u i d</ t> <pos s e t=" brown " c l a s s="jj" /> <pos s e t=" cgn " c l a s s=" adj " /> </w> There can be only one of the same set though, this is illegal and requires usage of alternatives instead: <w x m l : i d=" example.p.1. s.1. w.2"> <t>l u i d</ t> <pos s e t=" cgn " c l a s s=" adj " /> <pos s e t=" cgn " c l a s s=" adv " /> </w>
22 Alternatives Alternative Token Annotations Encodes mutually exclusive alternative annotations. Any annotations that are not alternatives are considered selected. <w x m l : i d=" example.p.1. s.1. w.2"> <t>bank</ t> <s e n s e s e t=" wordnet3.0" c l a s s=" bank %1 :17:01: " a n n o t a t o r=" Maarten van Gompel " a n n o t a t o r t y p e=" manual " c o n f i d e n c e=" 0.8 "> s l o p i n g ground near water</ s e n s e> < a l t x m l : i d=" example.p.1. s.1. w.2. alt.1"> <s e n s e s e t=" wordnet3.0" c l a s s=" bank %1 :14:01: " a n n o t a t o r=" WSDsystem " a n n o t a t o r t y p e=" auto " c o n f i d e n c e=" 0.6 "> f i n a n c i a l i n s t i t u t i o n</ s e n s e> </ a l t> </w>
23 Alternatives Alternative Token Annotations All token annotations grouped as one alternative are considered dependent. Multiple alternatives are always independent: <w x m l : i d=" example.p.1. s.1. w.2"> <t>v l i e g</ t> <pos c l a s s="n" /> <lemma c l a s s=" vlieg " /> < a l t x m l : i d=" example.p.1. s.1. w.2. alt.1"> <pos c l a s s="v" /> <lemma c l a s s=" vliegen " /> </ a l t> </w>
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv 1. The Metadata Format CMDI Metadata? Metadata Format? and more Metadata? Metadata Format?
Schema documentation for types1.2.xsd
Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
InternetVista Web scenario documentation
InternetVista Web scenario documentation Version 1.2 1 Contents 1. Change History... 3 2. Introduction to Web Scenario... 4 3. XML scenario description... 5 3.1. General scenario structure... 5 3.2. Steps
CLAM: Quickly deploy NLP command-line tools on the web
CLAM: Quickly deploy NLP command-line tools on the web Maarten van Gompel Centre for Language Studies (CLS) Radboud University Nijmegen [email protected] Martin Reynaert CLS, Radboud University Nijmegen
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations
Your boldest wishes concerning online corpora: OpenSoNaR and you
1 Your boldest wishes concerning online corpora: OpenSoNaR and you Martin Reynaert TiCC, Tilburg University and CLST, Radboud Universiteit Nijmegen TiCC Colloquium, Tilburg University. October 16th, 2013
Privilege Escalation via Antivirus Software
Privilege Escalation via Antivirus Software A security vulnerability in the software component McAfee Security Agent, which is part of the antivirus software McAfee VirusScan Enterprise, can be leveraged
XML: extensible Markup Language. Anabel Fraga
XML: extensible Markup Language Anabel Fraga Table of Contents Historic Introduction XML vs. HTML XML Characteristics HTML Document XML Document XML General Rules Well Formed and Valid Documents Elements
Internationalization Tag Set 1.0 A New Standard for Internationalization and Localization of XML
A New Standard for Internationalization and Localization of XML Felix Sasaki World Wide Web Consortium 1 San Jose, This presentation describes a new W3C Recommendation, the Internationalization Tag Set
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Moving Enterprise Applications into VoiceXML. May 2002
Moving Enterprise Applications into VoiceXML May 2002 ViaFone Overview ViaFone connects mobile employees to to enterprise systems to to improve overall business performance. Enterprise Application Focus;
The CroCo Translation Archive
LINGUISTIC PROPERTIES OF TRANSLATIONS A CORPUS-BASED INVESTIGATION FOR THE LANGUAGE PAIR ENGLISH-GERMAN The CroCo Translation Archive Language Archives: Standards, Creation and Access Mihaela Vela & Silvia
How To Create A Clarin Metadata Infrastructure
Creating & Testing CLARIN Metadata Components Folkert de Vriend (1), Daan Broeder (2), Griet Depoorter (3), Laura van Eerten (3), Dieter van Uytvanck (2) 1) Meertens Institute Joan Muyskenweg 25, Amsterdam,
Extensible Markup Language (XML): Essentials for Climatologists
Extensible Markup Language (XML): Essentials for Climatologists Alexander V. Besprozvannykh CCl OPAG 1 Implementation/Coordination Team The purpose of this material is to give basic knowledge about XML
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
CLARIN-NL Third Call: Closed Call
CLARIN-NL Third Call: Closed Call CLARIN-NL launches in its third call a Closed Call for project proposals. This called is only open for researchers who have been explicitly invited to submit a project
The University of Amsterdam s Question Answering System at QA@CLEF 2007
The University of Amsterdam s Question Answering System at QA@CLEF 2007 Valentin Jijkoun, Katja Hofmann, David Ahn, Mahboob Alam Khalid, Joris van Rantwijk, Maarten de Rijke, and Erik Tjong Kim Sang ISLA,
Dutch Parallel Corpus
Dutch Parallel Corpus Lieve Macken [email protected] LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT
Page: 1. Merging XML files: a new approach providing intelligent merge of XML data sets
Page: 1 Merging XML files: a new approach providing intelligent merge of XML data sets Robin La Fontaine, Monsell EDM Ltd [email protected] http://www.deltaxml.com Abstract As XML becomes ubiquitous
Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es
KYOTO () Intelligent Content and Semantics Knowledge Yielding Ontologies for Transition-Based Organization http://www.kyoto-project.eu/ Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU
A Conceptual Framework of Online Natural Language Processing Pipeline Application
A Conceptual Framework of Online Natural Language Processing Pipeline Application Chunqi Shi, Marc Verhagen, James Pustejovsky Brandeis University Waltham, United States {shicq, jamesp, marc}@cs.brandeis.edu
Semistructured data and XML. Institutt for Informatikk INF3100 09.04.2013 Ahmet Soylu
Semistructured data and XML Institutt for Informatikk 1 Unstructured, Structured and Semistructured data Unstructured data e.g., text documents Structured data: data with a rigid and fixed data format
My IC Customizer: Descriptors of Skins and Webapps for third party User Guide
User Guide 8AL 90892 USAA ed01 09/2013 Table of Content 1. About this Document... 3 1.1 Who Should Read This document... 3 1.2 What This Document Tells You... 3 1.3 Terminology and Definitions... 3 2.
An XML Based Data Exchange Model for Power System Studies
ARI The Bulletin of the Istanbul Technical University VOLUME 54, NUMBER 2 Communicated by Sondan Durukanoğlu Feyiz An XML Based Data Exchange Model for Power System Studies Hasan Dağ Department of Electrical
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
Novel Data Extraction Language for Structured Log Analysis
Novel Data Extraction Language for Structured Log Analysis P.W.D.C. Jayathilake 99X Technology, Sri Lanka. ABSTRACT This paper presents the implementation of a new log data extraction language. Theoretical
XML Processing and Web Services. Chapter 17
XML Processing and Web Services Chapter 17 Textbook to be published by Pearson Ed 2015 in early Pearson 2014 Fundamentals of http://www.funwebdev.com Web Development Objectives 1 XML Overview 2 XML Processing
GATE Mímir and cloud services. Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation
GATE Mímir and cloud services Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation GATE Mímir GATE Mímir is an indexing system for GATE documents. Mímir can index: Text: the original
Exchanger XML Editor - Canonicalization and XML Digital Signatures
Exchanger XML Editor - Canonicalization and XML Digital Signatures Copyright 2005 Cladonia Ltd Table of Contents XML Canonicalization... 2 Inclusive Canonicalization... 2 Inclusive Canonicalization Example...
StreamServe Persuasion SP5 Document Broker Plus
StreamServe Persuasion SP5 Document Broker Plus User Guide Rev A StreamServe Persuasion SP5 Document Broker Plus User Guide Rev A 2001-2010 STREAMSERVE, INC. ALL RIGHTS RESERVED United States patent #7,127,520
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
Making Content Editable. Create re-usable email templates with total control over the sections you can (and more importantly can't) change.
Making Content Editable Create re-usable email templates with total control over the sections you can (and more importantly can't) change. Single Line Outputs a string you can modify in the
SHared Access Research Ecosystem (SHARE)
SHared Access Research Ecosystem (SHARE) June 7, 2013 DRAFT Association of American Universities (AAU) Association of Public and Land-grant Universities (APLU) Association of Research Libraries (ARL) This
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Fast track to HTML & CSS 101 (Web Design)
Fast track to HTML & CSS 101 (Web Design) Level: Introduction Duration: 5 Days Time: 9:30 AM - 4:30 PM Cost: 997.00 Overview Fast Track your HTML and CSS Skills HTML and CSS are the very fundamentals of
Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager
Paper SAS1787-2015 Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager Chris Upton and Lori Small, SAS Institute Inc. ABSTRACT With the latest release of SAS
BASICS OF WEB DESIGN CHAPTER 2 HTML BASICS KEY CONCEPTS COPYRIGHT 2013 TERRY ANN MORRIS, ED.D
BASICS OF WEB DESIGN CHAPTER 2 HTML BASICS KEY CONCEPTS COPYRIGHT 2013 TERRY ANN MORRIS, ED.D 1 LEARNING OUTCOMES Describe the anatomy of a web page Format the body of a web page with block-level elements
Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured
Structured vs. unstructured data 2 Databases are highly structured Semistructured data, XML, DTDs Well known data format: relations and tuples Every tuple conforms to a known schema Data independence?
Integration of Hotel Property Management Systems (HPMS) with Global Internet Reservation Systems
Integration of Hotel Property Management Systems (HPMS) with Global Internet Reservation Systems If company want to be competitive on global market nowadays, it have to be persistent on Internet. If we
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web
Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web Corey A Harper University of Oregon Libraries Tel: +1 541 346 1854 Fax:+1 541 346 3485 [email protected]
DEPENDENCY PARSING JOAKIM NIVRE
DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect
10CS73:Web Programming
10CS73:Web Programming Question Bank Fundamentals of Web: 1.What is WWW? 2. What are domain names? Explain domain name conversion with diagram 3.What are the difference between web browser and web server
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.
WIRIS quizzes web services Getting started with PHP and Java
WIRIS quizzes web services Getting started with PHP and Java Document Release: 1.3 2011 march, Maths for More www.wiris.com Summary This document provides client examples for PHP and Java. Contents WIRIS
US Secure Web API. Version 1.6.0
US Secure Web API Version 1.6.0 September 8, 2015 Contents 1 Introduction... 4 1.1 Overview... 4 1.2 Requirements... 4 1.3 Access... 4 1.4 Obtaining a Developer Key and Secret... 5 2 Request Structure...
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
Using the BNC to create and develop educational materials and a website for learners of English
Using the BNC to create and develop educational materials and a website for learners of English Danny Minn a, Hiroshi Sano b, Marie Ino b and Takahiro Nakamura c a Kitakyushu University b Tokyo University
Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Getting Started Guide for Developing tibbr Apps
Getting Started Guide for Developing tibbr Apps TABLE OF CONTENTS Understanding the tibbr Marketplace... 2 Integrating Apps With tibbr... 2 Developing Apps for tibbr... 2 First Steps... 3 Tutorial 1: Registering
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Automated Extraction of Security Policies from Natural-Language Software Documents
Automated Extraction of Security Policies from Natural-Language Software Documents Xusheng Xiao 1 Amit Paradkar 2 Suresh Thummalapenta 3 Tao Xie 1 1 Dept. of Computer Science, North Carolina State University,
Federal Employee Viewpoint Survey Online Reporting and Analysis Tool
Federal Employee Viewpoint Survey Online Reporting and Analysis Tool Tutorial January 2013 NOTE: If you have any questions about the FEVS Online Reporting and Analysis Tool, please contact your OPM point
Cloud Monitoring and Auditing with CADF (Cloud Auditing and Data Federation)
July, 2013 Portland Cloud Monitoring and Auditing with CADF (Cloud Auditing and Data Federation) Jacques Durand (Fujitsu) Matt Rutkowski (IBM) Disclaimer The information in this presentation represents
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
The Knowledge Sharing Infrastructure KSI. Steven Krauwer
The Knowledge Sharing Infrastructure KSI Steven Krauwer 1 Why a KSI? Building or using a complex installation requires specialized skills and expertise. CLARIN is no exception. CLARIN is populated with
XES. Standard Definition. Where innovation starts. Den Dolech 2, 5612 AZ Eindhoven P.O. Box 513, 5600 MB Eindhoven The Netherlands www.tue.
Den Dolech 2, 5612 AZ Eindhoven P.O. Box 513, 5600 MB Eindhoven The Netherlands www.tue.nl Author Christian W. Günther and Eric Verbeek Date October 29, 2012 Version 1.4 XES Standard Definition Where innovation
How RAI's Hyper Media News aggregation system keeps staff on top of the news
How RAI's Hyper Media News aggregation system keeps staff on top of the news 13 th Libre Software Meeting Media, Radio, Television and Professional Graphics Geneva - Switzerland, 10 th July 2012 Maurizio
GetLibraryUserOrderList
GetLibraryUserOrderList Webservice name: GetLibraryUserOrderList Adress: https://www.elib.se/webservices/getlibraryuserorderlist.asmx WSDL: https://www.elib.se/webservices/getlibraryuserorderlist.asmx?wsdl
The Web Web page Links 16-3
Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Write basic HTML documents Describe several specific HTML tags and their purposes 16-1 Chapter Goals
HTSQL is a comprehensive navigational query language for relational databases.
http://htsql.org/ HTSQL A Database Query Language HTSQL is a comprehensive navigational query language for relational databases. HTSQL is designed for data analysts and other accidental programmers who
Web Services Technologies
Web Services Technologies XML and SOAP WSDL and UDDI Version 16 1 Web Services Technologies WSTech-2 A collection of XML technology standards that work together to provide Web Services capabilities We
Terms and Definitions for CMS Administrators, Architects, and Developers
Sitecore CMS 6 Glossary Rev. 081028 Sitecore CMS 6 Glossary Terms and Definitions for CMS Administrators, Architects, and Developers Table of Contents Chapter 1 Introduction... 3 1.1 Glossary... 4 Page
The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)
The Language Archive at the Max Planck Institute for Psycholinguistics Alexander König (with thanks to J. Ringersma) Fourth SLCN Workshop, Berlin, December 2010 Content 1.The Language Archive Why Archiving?
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
How is it helping? PragmatiQa XOData : Overview with an Example. P a g e 1 12. Doc Version : 1.3
XOData is a light-weight, practical, easily accessible and generic OData API visualizer / data explorer that is useful to developers as well as business users, business-process-experts, Architects etc.
Dynamic Styling in Web Development
Dynamic Styling in Web Development Creating PDSS, a Powerful Dynamic Styling Solution Markup Styling Client Side Programming Server Side Programming Database Querying Master s thesis by: Daniel Solsø Korsgård
T-110.5140 Network Application Frameworks and XML Web Services and WSDL 15.2.2010 Tancred Lindholm
T-110.5140 Network Application Frameworks and XML Web Services and WSDL 15.2.2010 Tancred Lindholm Based on slides by Sasu Tarkoma and Pekka Nikander 1 of 20 Contents Short review of XML & related specs
UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis
UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis Jan Hajič, jr. Charles University in Prague Faculty of Mathematics
The Prolog Interface to the Unstructured Information Management Architecture
The Prolog Interface to the Unstructured Information Management Architecture Paul Fodor 1, Adam Lally 2, David Ferrucci 2 1 Stony Brook University, Stony Brook, NY 11794, USA, [email protected] 2 IBM
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
Introduction to Web Services
Department of Computer Science Imperial College London CERN School of Computing (icsc), 2005 Geneva, Switzerland 1 Fundamental Concepts Architectures & escience example 2 Distributed Computing Technologies
HTML Web Page That Shows Its Own Source Code
HTML Web Page That Shows Its Own Source Code Tom Verhoeff November 2009 1 Introduction A well-known programming challenge is to write a program that prints its own source code. For interpreted languages,
Integrating Annotation Tools into UIMA for Interoperability
Integrating Annotation Tools into UIMA for Interoperability Scott Piao, Sophia Ananiadou and John McNaught School of Computer Science & National Centre for Text Mining The University of Manchester UK {scott.piao;sophia.ananiadou;john.mcnaught}@manchester.ac.uk
Developing XML Solutions with JavaServer Pages Technology
Developing XML Solutions with JavaServer Pages Technology XML (extensible Markup Language) is a set of syntax rules and guidelines for defining text-based markup languages. XML languages have a number
2. Distributed Handwriting Recognition. Abstract. 1. Introduction
XPEN: An XML Based Format for Distributed Online Handwriting Recognition A.P.Lenaghan, R.R.Malyan, School of Computing and Information Systems, Kingston University, UK {a.lenaghan,r.malyan}@kingston.ac.uk
