Towards a Universal Grammar for Natural Language Processing

Similar documents
Universal Dependencies

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Languages Supported. SpeechGear s products are being used to remove communications barriers throughout the world.

PRICE LIST. ALPHA TRANSLATION AGENCY

Reference Guide: Approved Vendors for Translation and In-Person Interpretation Services

Activities. but I will require that groups present research papers

Statistical Machine Translation

LANGUAGE CONNECTIONS YOUR LINGUISTIC GATEWAY

LSI TRANSLATION PLUG-IN FOR RELATIVITY. within

GCE/GCSE subjects recognised for NUI matriculation purposes

Remote Desktop Services Guide

Professional. Accurate. Fast.

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Speaking your language...

Safe Harbor Statement

INTERC O MBASE. Global Language Solution

Table 1: TSQM Version 1.4 Available Translations

RESEARCH ASSISTANCE. The Portal is also accessible to the general public but restricted to the free case law databases.

Translution Price List GBP

Tel: Fax: P.O. Box: 22392, Dubai - UAE info@communicationdubai.com comm123@emirates.net.ae

Annotation Guidelines for Dutch-English Word Alignment

Who We Are. Services We Offer

Overview of admission requirements for the master s degree programs of the Faculty of Arts

We Answer To All Your Localization Needs!

Introductory Guide to the Common European Framework of Reference (CEFR) for English Language Teachers

Linking the world through professional language services

List of Higher School Certificate Board Developed Courses

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

We Answer All Your Localization Needs!

Survey of University of Michigan Graduate-level Area Studies Alumni/ae & FLAS Recipients from : Selected Findings

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Quality Data for Your Information Infrastructure

Less Grammar, More Features

Hybrid Strategies. for better products and shorter time-to-market

Paraphrasing controlled English texts

Yandex.Translate API Developer's guide

Parsing Swedish. Atro Voutilainen Conexor oy CG and FDG

Knowledge of Foreign Languages in the Czech Republic

MT Search Elastic Search for Magento

Formatting Custom List Information

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Luxembourg-Luxembourg: FL/SCIENT15 Translation services 2015/S Contract notice. Services

European Economic and Social Committee

Building gold-standard treebanks for Norwegian

Brasshouse Languages Course programme September to December 2016

Special Topics in Computer Science

Microsoft stores badge guidelines. February 2016

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

External Candidate Online Application

Luxembourg-Luxembourg: FL/TERM15 Translation services 2015/S Contract notice. Services

Syntactic Transfer Using a Bilingual Lexicon

About CRC? What is Link?

Cross-Language Instant Messaging with Automatic Translation

Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian

Translating for a Multilingual European Union: Putting Multilingualism into Context Dr Angeliki PETRITS Language Officer European Commission, UK

Product Globalization Service. A Partner You Can Trust

LocaTran Translations Ltd. Professional Translation, Localization and DTP Solutions.

EAP Grammar Competencies Levels 1 6

Luxembourg-Luxembourg: FL/RAIL16 Translation services 2016/S Contract notice. Services

Automatic Detection and Correction of Errors in Dependency Treebanks

Context Grammar and POS Tagging

IBM Content Analytics with Enterprise Search, Version 3.0

Veronika VINCZE, PhD. PERSONAL DATA Date of birth: 1 July 1981 Nationality: Hungarian

Syntactic Theory. Background and Transformational Grammar. Dr. Dan Flickinger & PD Dr. Valia Kordoni

Outline of today s lecture

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Rule based Sentence Simplification for English to Tamil Machine Translation System

A global leader in document translations

31 Case Studies: Java Natural Language Tools Available on the Web

THE ETHICS HELPLINE Worldwide Dialing Instructions April 2012

Internet sites for machine translation available language-pairs ** Part 1 direct translation sites

Structure of Clauses. March 9, 2004

Research Portfolio. Beáta B. Megyesi January 8, 2007

Chinese Open Relation Extraction for Knowledge Acquisition

webcertain Recruitment pack Ceri Wright [Pick the date]

IPCC translation and interpretation policy. February 2015

Why are Organizations Interested?

HP Business Notebook Password Localization Guidelines V1.0

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

According to the Argentine writer Jorge Luis Borges, in the Celestial Emporium of Benevolent Knowledge, animals are divided

placing people first SALARY REPORT Summary of 2014 Bratislava

Cyclope Internet Filtering Proxy. - User Guide -

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Transcription:

Towards a Universal Grammar for Natural Language Processing Joakim Nivre Uppsala University Department of Linguistics and Philology Based on collaborative work with Filip Ginter, Yoav Goldberg, Jan Hajič, Chris Manning, Ryan McDonald, Natalia Silveira, Marie de Marneffe, Slav Petrov, Sampo Pyysalo, Reut Tsarfaty, Daniel Zeman and many others

In its substance, grammar is one and the same in all languages, even if it accidentally varies.

In its substance, grammar is one and the same in all languages, even if it accidentally varies.

In its substance, grammar is one and the same in all languages, even if it accidentally varies.

Universal Grammar

Universal Grammar All human languages are species of a common genus

Universal Grammar All human languages are species of a common genus Language structure is constrained by a universal cause

Universal Grammar All human languages are species of a common genus Language structure is constrained by a universal cause There is order in the chaos of linguistic variation

Natural Language Processing

Natural Language Processing Linguistic diversity makes our life harder Why 90% parsing accuracy for English but only 80% for Finnish? Can we even compare the numbers?

Natural Language Processing Linguistic diversity makes our life harder Why 90% parsing accuracy for English but only 80% for Finnish? Can we even compare the numbers? Current NLP relies heavily on linguistic annotation: In its substance, grammar is the same in all languages, even if the annotation accidentally varies.

Natural Language Processing Linguistic diversity makes our life harder Why 90% parsing accuracy for English but only 80% for Finnish? Can we even compare the numbers? Current NLP relies heavily on linguistic annotation: In its substance, grammar is the same in all languages, even if the annotation accidentally varies. We need to bring some order into the chaos

Language X dobj conj conj En katt jagar råttor och möss? dobj cc conj En kat jager råder og møs dobj cc conj A cat chases rats and mice

Language X dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj En kat jager råder og møs En kat jager råder og møs conj conj dobj cc A cat chases rats and mice A cat chases rats and mice conj? dobj cc conj dobj cc

Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj

Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj Which languages are most closely related?

1/5 Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj Which languages are most closely related?

1/5 Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj 2/5 Which languages are most closely related?

1/5 Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats 2/5 and mice conj Language Z conj conj 2/5 Which languages are most closely related?

Language Swedish X dobj dobj dobj conj conj dobj dobj conj conj conj conj En conj katt jagar conj råttor och möss En katt jagar råttor och möss En En katt katt jagar jagar råttor råttor och och möss möss En katt jagar råttor och möss 1/5 Language Danish Y? dobj dobj cc cc conj conj dobj cc conj? dobj cc conj? dobj En kat cc jager conj rotter og mus En kat jager råder og møs råder og 2/5 møs En kat jager rotter råder og mus møs En kat jager rotter og mus Language English Z conj conj dobj cc conj dobj cc dobj cc dobj dobj A cat cc cc chases rats and mice A cat chases rats and mice A cat cat chases chases rats rats 2/5 and and mice A cat chases rats and mice mice advmod Which languages advmod are most closely related? Toutefois, les filles adorent les Toutefois, toutefois les, fillestoutefois les adorent, fille les les adorer filles desserts les toutefois, ADV les PUNCTfilletoutefois DET adorer, NOUN les les VERB fille dessert DET dobj conj advmod conj conj dob

Why is this a problem?

Why is this a problem? Hard to compare empirical results across languages

Why is this a problem? Hard to compare empirical results across languages Hard to evaluate cross-lingual learning

Why is this a problem? Hard to compare empirical results across languages Hard to evaluate cross-lingual learning Hard to build and maintain multilingual systems

Why is this a problem? Hard to compare empirical results across languages Hard to evaluate cross-lingual learning Hard to build and maintain multilingual systems Hard to make progress towards a universal parser

dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres

dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Part-of-speech tags

dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Part-of-speech tags Morphological features

dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice Dependency relations advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Part-of-speech tags Morphological features

Universal Dependencies http://universaldependencies.org

Universal Dependencies http://universaldependencies.org Stanford Dependencies

Universal Dependencies http://universaldependencies.org Stanford Dependencies Google UD

Universal Dependencies http://universaldependencies.org Stanford Dependencies Stanford UD Google UD

Universal Dependencies http://universaldependencies.org Stanford Dependencies Stanford UD Google UD HamleDT

Universal Dependencies http://universaldependencies.org Stanford Dependencies Stanford UD Interset Google UD HamleDT

Universal Dependencies http://universaldependencies.org Stanford Dependencies Google UD Stanford UD HamleDT Interset Google universal tags

Universal Dependencies http://universaldependencies.org Universal Dependencies

Universal Dependencies http://universaldependencies.org Universal Dependencies Milestones: Kick-off meeting at EACL in Gothenburg, April 2014 Release of annotation guidelines, Version 1, October 2014 Release of treebanks for 10 languages, January 2015 Release of treebanks for 18 languages, May 2015 Release of treebanks for 33 languages, November 2015 Open community effort anyone can contribute!

Goals and Requirements

Goals and Requirements Cross-linguistically consistent grammatical annotation

Goals and Requirements Cross-linguistically consistent grammatical annotation Support multilingual research and development in NLP

Goals and Requirements Cross-linguistically consistent grammatical annotation Support multilingual research and development in NLP Based on common usage and existing de facto standards

Design Principles

Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages

Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages Lexicalism Basic annotation units are words syntactic words Words have morphological properties Words enter into syntactic relations

Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages Lexicalism Basic annotation units are words syntactic words Words have morphological properties Words enter into syntactic relations Recoverability Transparent mapping from input text to word segmentation

Golden Rules

Golden Rules Maximize parallelism Don t annotate the same thing in different ways Don t make different things look the same

Golden Rules Maximize parallelism Don t annotate the same thing in different ways Don t make different things look the same But don t overdo it Don t annotate things that are not there Languages select from a universal pool of categories Allow language-specific extensions

En kat jager rotter og mus dobj cc conj A cat chases rats Morphology and mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres

En kat jager rotter og mus En kat jager rotter og mus conj conj dobj cc dobj cc A cat cat chases chases rats rats Morphology and and mice mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Lemma representing the semantic content of the word advmod Toutefois, les filles adorent les desserts. toutefois, le fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres dobj

En kat jager rotter og mus En kat jager rotter og mus conj conj dobj cc dobj cc A cat cat chases chases rats rats Morphology and and mice mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Lemma representing the semantic content of the word advmod Part-of-speech tag representing the abstract lexical category associated with the word Toutefois, les filles adorent les desserts. toutefois, le fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres dobj

En kat jager rotter og mus En kat jager rotter og mus conj conj dobj cc dobj cc A cat cat chases chases rats rats Morphology and and mice mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Lemma representing the semantic content of the word advmod Part-of-speech tag representing the abstract lexical category associated with the word Toutefois, les filles adorent les desserts. toutefois, le fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Features representing lexical and grammatical properties Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur associated with the lemma or the particular word form Tense=Pres dobj

Part-of-Speech Tags Open Closed Other ADJ ADP PUNCT ADV AUX SYM INTJ CONJ X NOUN PROPN VERB DET NUM PART PRON SCONJ Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset (Petrov et al., 2012) All languages use the same inventory, but not all tags have to be used by all languages

Features Lexical Inflectional Nominal Inflectional Verbal PronType Gender VerbForm NumType Animacy Mood Poss Number Tense Reflex Case Aspect Definite Voice Degree Person Negative Standardized inventory of morphological features, based on the Interset system (Zeman, 2008) Languages select relevant features and can add languagespecific features or values with documentation

Syntax nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT

nmod aux aux dobj Syntax The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT case nmod dobj The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT Content words are related by dependency relations

nmod aux aux dobj Syntax The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT case nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT Content words are related by dependency relations nmod Function words attach to the content word they modify dobj The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT

Syntax nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT Content words are related by dependency relations Function words attach to the content word they modify dobj aux Punctuation attach aux to head of phrase or clause The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT nmod case

pass case Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def pass nmod The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def nmod

pass Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def pass nmod The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def

pass auxpass nmod The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def

pass auxpass nmod case The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod case Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def nmod

Dependency Relations

Dependency Relations Taxonomy of 40 universal grammatical relations, broadly attested in language typology (de Marneffe et al., 2014) Language-specific subtypes may be added

Dependency Relations Taxonomy of 40 universal grammatical relations, broadly attested in language typology (de Marneffe et al., 2014) Language-specific subtypes may be added Organizing principles Three types of structures: nominals, clauses, modifiers Core arguments vs. other dependents (not complements vs. adjuncts)

Dependents of Clausal Predicates Nominal Clausal Other Core pass dobj iobj csubj csubjpass ccomp xcomp Non-Core nmod vocative discourse expl advcl advmod neg aux auxpass cop mark

nmod nmod aux dobj case advmod Mary was quietly reading a book in the garden. PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT advcl mark aux cop neg If you are sick, you should not exercise. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT ccomp mark aux xcomp Peter thought that he should stop smoking. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos mark

nmod nmod aux dobj case aux dobj case advmod advmod Mary was quietly reading book in the garden Mary was quietly reading a book in the garden. PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT advcl advcl mark mark aux aux cop neg cop neg If you are sick you should not exercise If you are sick, you should not exercise. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT ccomp ccomp mark mark aux aux xcomp xcomp Peter thought that he should stop smoking Peter thought that he should stop smoking. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos appos mark mark nmod

aux aux auxadvmod advmod advmod dobj dobj dobj nmod nmod nmod case case case Mary was quietly reading book in the garden Mary was quietly reading book in the garden PROPN Mary AUX was quietly ADV reading VERB DET a NOUN book ADP in DET the NOUN garden PUNCT. PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT mark mark mark cop cop cop advcl advcl advcl aux aux aux If you are sick you should not exercise If you are sick you should not exercise SCONJ If PRON you AUX are ADJ sick PUNCT, PRON you should AUX ADV not exercise VERB PUNCT. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT ccomp ccomp ccompmark mark mark aux aux aux xcomp xcomp xcomp Peter thought that he should stop smoking Peter thought that he should stop smoking PROPN Peter thought VERB SCONJ that PRON he should AUX VERB stop smoking VERB PUNCT. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos appos appos mark mark mark nmod neg neg neg

PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT advcl Dependents mark of Nominals aux cop neg If you are sick, you should not exercise. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT Nominal Clausal Other nummod appos nmod acl ccomp mark amod case xcomp Peter thought that he should stop smoking. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos mark amod nmod case Cairo, the lovely capital of Egypt PROPN PUNCT DET ADJ NOUN ADP PROPN

Coordination conj cc () appos mark Coordination Cairo, the lovely capital of Eg PROPN PUNCT DET ADJ NOUN ADP PR amod nmod case conj cc conj Huey, Dewey and Louie PROPN PUNCT PROPN CONJ PROPN Coordinate structures are headed by the first conjunct Subsequent conjuncts depend on it via the conj relation Conjunctions depend on it via the cc relation Punctuation marks depend on it via the relation

Multiword Expressions Relation mwe name compound goeswith Examples in spite of, as well as, ad hoc Roger Bacon, Carl XVI Gustaf, New York phone book, four thousand, dress up notwith standing, with out UD annotation does not permit words with spaces Multiword expressions are analysed using special relations The mwe, name and goeswith relations are always head-initial The compound relation reflects the internal structure

Other Relations Relation parataxis list remnant reparandum foreign dep Explanation Loosely linked clauses of same rank Lists without syntactic structure Orphans in ellipsis linked to parallel elements Disfluency linked to (speech) repair Elements within opaque stretches of code switching Unspecified dependency Syntactically independent element of clause/phrase

Language-Specific Relations Language-specific relations are subtypes of universal relations added to capture important phenomena Subtyping permits us to back off to universal relations Relation acl:relcl compound:prt nmod:poss nmod:agent cc:preconj :pre Explanation Relative clause Verb particle (dress up) Genitive nominal (Mary s book) Agent in passive (saved by the bell) Preconjunction (both and) Preerminer (all those )

Word Segmentation

Word Segmentation How do we segment sentences into words? Dependent on language and writing system, often non-trivial Segmentation must be reproducible on new data

Word Segmentation How do we segment sentences into words? Dependent on language and writing system, often non-trivial Segmentation must be reproducible on new data Two options provided: Only include words in treebank, but document segmentation Include mapping from low-level tokenisation to words in treebank

Word Segmentation How do we segment sentences into words? Dependent on language and writing system, often non-trivial Segmentation must be reproducible on new data Two options provided: Only include words in treebank, but document segmentation Include mapping from low-level tokenisation to words in treebank Vamos nos a el mar. VERB PRON ADP DET NOUN PUNCT Vámonos al mar.?? NOUN PUNCT

CoNLL-U Format Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID 1-2 1 2 3-4 3 4 5 6 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID 1-2 1 2 3-4 3 4 5 6 FORM Vámonos Vamos nos al a el mar. Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID 1-2 1 2 3-4 3 4 5 6 FORM Vámonos Vamos nos al a el mar. LEMMA ir nosotros a el mar. Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG 1-2 Vámonos 1 Vamos ir VERB 2 nos nosotros PRON 3-4 al 3 a a ADP 4 el el DET 5 mar mar NOUN 6... Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG 1-2 Vámonos 1 Vamos ir VERB 2 nos nosotros PRON 3-4 al 3 4 a el a el ADP DET 5 mar mar NOUN 6... Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 2 nos nosotros PRON PronType=Per Number=Plur Person=1 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 6... Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 expl 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 case nmod 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 expl 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 case nmod 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 expl 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 case nmod 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies

Where are we now?

Where are we now? Universal Dependencies, Version 1 Guidelines released October 2014 Latest treebank release November 2015 (v1.2): Ancient Greek, Arabic, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Slovenian, Spanish, Swedish, Tamil

Where are we now? Universal Dependencies, Version 1 Guidelines released October 2014 Latest treebank release November 2015 (v1.2): Ancient Greek, Arabic, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Slovenian, Spanish, Swedish, Tamil Future plans: New releases every six months (May, November) Revision of guidelines as needed

Where are we now? Universal Dependencies, Version 1 Guidelines released October 2014 Latest treebank release November 2015 (v1.2): Ancient Greek, Arabic, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Slovenian, Spanish, Swedish, Tamil Future plans: New releases every six months (May, November) Revision of guidelines as needed Have a look at http://universaldependencies.org

So what exactly is UD?

So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies

So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing

So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing The ultimate annotation scheme? Not quite, more like a lingua franca for treebank developers and definitely useful for some annotation projects

So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing The ultimate annotation scheme? Not quite, more like a lingua franca for treebank developers and definitely useful for some annotation projects A universal grammar? Not in the Chomskyan sense, but hopefully in the more practical sense of facilitating multilingual NLP by bringing a little order into the chaos

So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing The ultimate annotation scheme? Not quite, more like a lingua franca for treebank developers and definitely useful for some annotation projects A universal grammar? Well, who knows? Not in the Chomskyan sense, but hopefully in the more practical sense of facilitating multilingual NLP by bringing a little order into the chaos

Acknowledgments Core UD group: Filip Ginter, Yoav Goldberg, Jan Hajič, Chris Manning, Ryan McDonald, Natalia Silveira, Marie de Marneffe, Slav Petrov, Sampo Pyysalo, Reut Tsarfaty, Dan Zeman UD contributors: Željko Agić, Riyaz Ahmad, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Cristina Bosco, Giuseppe G. A. Celano, Jinho Choi, Çağrı Çöltekin, Kaja Dobrovoljc, Timothy Dozat, Binyam Ephrem, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Koldo Gojenola, Iakes Goenaga, Bruno Guillaume, Nizar Habash, Dag Haug, Anders Trærup Johannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Juha Kuokkala, Veronika Laippala, Alessandro Lenci, Krister Lindén, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Catalina Maranduc, Héctor Martínez Alonso, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Shinsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Jussi Piitulainen, Barbara Plank, Prokopis Prokopidis, Loganathan Ramasamy, Wolfgang Seeker, Mojgan Seraji, Maria Simi, Kiril Simov, Arne Skjæerholt, Aaron Smith, Jan Štěpánek,Takaaki Tanaka, Francis Tyers, Sumire Uematsu, Veronika Vincze, Rob Voigt, Jonathan Washington