Towards a Universal Grammar for Natural Language Processing Joakim Nivre Uppsala University Department of Linguistics and Philology Based on collaborative work with Filip Ginter, Yoav Goldberg, Jan Hajič, Chris Manning, Ryan McDonald, Natalia Silveira, Marie de Marneffe, Slav Petrov, Sampo Pyysalo, Reut Tsarfaty, Daniel Zeman and many others
In its substance, grammar is one and the same in all languages, even if it accidentally varies.
In its substance, grammar is one and the same in all languages, even if it accidentally varies.
In its substance, grammar is one and the same in all languages, even if it accidentally varies.
Universal Grammar
Universal Grammar All human languages are species of a common genus
Universal Grammar All human languages are species of a common genus Language structure is constrained by a universal cause
Universal Grammar All human languages are species of a common genus Language structure is constrained by a universal cause There is order in the chaos of linguistic variation
Natural Language Processing
Natural Language Processing Linguistic diversity makes our life harder Why 90% parsing accuracy for English but only 80% for Finnish? Can we even compare the numbers?
Natural Language Processing Linguistic diversity makes our life harder Why 90% parsing accuracy for English but only 80% for Finnish? Can we even compare the numbers? Current NLP relies heavily on linguistic annotation: In its substance, grammar is the same in all languages, even if the annotation accidentally varies.
Natural Language Processing Linguistic diversity makes our life harder Why 90% parsing accuracy for English but only 80% for Finnish? Can we even compare the numbers? Current NLP relies heavily on linguistic annotation: In its substance, grammar is the same in all languages, even if the annotation accidentally varies. We need to bring some order into the chaos
Language X dobj conj conj En katt jagar råttor och möss? dobj cc conj En kat jager råder og møs dobj cc conj A cat chases rats and mice
Language X dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj En kat jager råder og møs En kat jager råder og møs conj conj dobj cc A cat chases rats and mice A cat chases rats and mice conj? dobj cc conj dobj cc
Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj
Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj Which languages are most closely related?
1/5 Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj Which languages are most closely related?
1/5 Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats and mice conj Language Z conj conj 2/5 Which languages are most closely related?
1/5 Language X dobj conj conj dobj dobj conj conj conj En katt jagar råttor och möss En katt jagar råttor och möss En katt jagar råttor och möss Language Y? dobj cc conj?? dobj cc conj dobj cc conj En kat jager råder og møs En kat jager råder og møs En kat jager råder og møs conj dobj cc dobj cc dobj cc A cat chases rats and mice A cat chases rats and mice A cat chases rats 2/5 and mice conj Language Z conj conj 2/5 Which languages are most closely related?
Language Swedish X dobj dobj dobj conj conj dobj dobj conj conj conj conj En conj katt jagar conj råttor och möss En katt jagar råttor och möss En En katt katt jagar jagar råttor råttor och och möss möss En katt jagar råttor och möss 1/5 Language Danish Y? dobj dobj cc cc conj conj dobj cc conj? dobj cc conj? dobj En kat cc jager conj rotter og mus En kat jager råder og møs råder og 2/5 møs En kat jager rotter råder og mus møs En kat jager rotter og mus Language English Z conj conj dobj cc conj dobj cc dobj cc dobj dobj A cat cc cc chases rats and mice A cat chases rats and mice A cat cat chases chases rats rats 2/5 and and mice A cat chases rats and mice mice advmod Which languages advmod are most closely related? Toutefois, les filles adorent les Toutefois, toutefois les, fillestoutefois les adorent, fille les les adorer filles desserts les toutefois, ADV les PUNCTfilletoutefois DET adorer, NOUN les les VERB fille dessert DET dobj conj advmod conj conj dob
Why is this a problem?
Why is this a problem? Hard to compare empirical results across languages
Why is this a problem? Hard to compare empirical results across languages Hard to evaluate cross-lingual learning
Why is this a problem? Hard to compare empirical results across languages Hard to evaluate cross-lingual learning Hard to build and maintain multilingual systems
Why is this a problem? Hard to compare empirical results across languages Hard to evaluate cross-lingual learning Hard to build and maintain multilingual systems Hard to make progress towards a universal parser
dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres
dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Part-of-speech tags
dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Part-of-speech tags Morphological features
dobj conj conj En katt Universal jagar råttor och möss Dependencies? dobj cc conj En kat jager rotter og mus http://universaldependencies.org dobj cc conj A cat chases rats and mice Dependency relations advmod dobj Toutefois, les filles adorent les desserts. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Part-of-speech tags Morphological features
Universal Dependencies http://universaldependencies.org
Universal Dependencies http://universaldependencies.org Stanford Dependencies
Universal Dependencies http://universaldependencies.org Stanford Dependencies Google UD
Universal Dependencies http://universaldependencies.org Stanford Dependencies Stanford UD Google UD
Universal Dependencies http://universaldependencies.org Stanford Dependencies Stanford UD Google UD HamleDT
Universal Dependencies http://universaldependencies.org Stanford Dependencies Stanford UD Interset Google UD HamleDT
Universal Dependencies http://universaldependencies.org Stanford Dependencies Google UD Stanford UD HamleDT Interset Google universal tags
Universal Dependencies http://universaldependencies.org Universal Dependencies
Universal Dependencies http://universaldependencies.org Universal Dependencies Milestones: Kick-off meeting at EACL in Gothenburg, April 2014 Release of annotation guidelines, Version 1, October 2014 Release of treebanks for 10 languages, January 2015 Release of treebanks for 18 languages, May 2015 Release of treebanks for 33 languages, November 2015 Open community effort anyone can contribute!
Goals and Requirements
Goals and Requirements Cross-linguistically consistent grammatical annotation
Goals and Requirements Cross-linguistically consistent grammatical annotation Support multilingual research and development in NLP
Goals and Requirements Cross-linguistically consistent grammatical annotation Support multilingual research and development in NLP Based on common usage and existing de facto standards
Design Principles
Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages
Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages Lexicalism Basic annotation units are words syntactic words Words have morphological properties Words enter into syntactic relations
Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages Lexicalism Basic annotation units are words syntactic words Words have morphological properties Words enter into syntactic relations Recoverability Transparent mapping from input text to word segmentation
Golden Rules
Golden Rules Maximize parallelism Don t annotate the same thing in different ways Don t make different things look the same
Golden Rules Maximize parallelism Don t annotate the same thing in different ways Don t make different things look the same But don t overdo it Don t annotate things that are not there Languages select from a universal pool of categories Allow language-specific extensions
En kat jager rotter og mus dobj cc conj A cat chases rats Morphology and mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres
En kat jager rotter og mus En kat jager rotter og mus conj conj dobj cc dobj cc A cat cat chases chases rats rats Morphology and and mice mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Lemma representing the semantic content of the word advmod Toutefois, les filles adorent les desserts. toutefois, le fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres dobj
En kat jager rotter og mus En kat jager rotter og mus conj conj dobj cc dobj cc A cat cat chases chases rats rats Morphology and and mice mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Lemma representing the semantic content of the word advmod Part-of-speech tag representing the abstract lexical category associated with the word Toutefois, les filles adorent les desserts. toutefois, le fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres dobj
En kat jager rotter og mus En kat jager rotter og mus conj conj dobj cc dobj cc A cat cat chases chases rats rats Morphology and and mice mice advmod dobj Toutefois, les filles adorent les desserts. toutefois, les fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur Tense=Pres Lemma representing the semantic content of the word advmod Part-of-speech tag representing the abstract lexical category associated with the word Toutefois, les filles adorent les desserts. toutefois, le fille adorer les dessert. ADV PUNCT DET NOUN VERB DET NOUN PUNCT Features representing lexical and grammatical properties Definite=Def Gender=Fem Number=Plur Definite=Def Gender=Masc Number=Plur Number=Plur Person=3 Number=Plur Number=Plur associated with the lemma or the particular word form Tense=Pres dobj
Part-of-Speech Tags Open Closed Other ADJ ADP PUNCT ADV AUX SYM INTJ CONJ X NOUN PROPN VERB DET NUM PART PRON SCONJ Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset (Petrov et al., 2012) All languages use the same inventory, but not all tags have to be used by all languages
Features Lexical Inflectional Nominal Inflectional Verbal PronType Gender VerbForm NumType Animacy Mood Poss Number Tense Reflex Case Aspect Definite Voice Degree Person Negative Standardized inventory of morphological features, based on the Interset system (Zeman, 2008) Languages select relevant features and can add languagespecific features or values with documentation
Syntax nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT
nmod aux aux dobj Syntax The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT case nmod dobj The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT Content words are related by dependency relations
nmod aux aux dobj Syntax The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT case nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT Content words are related by dependency relations nmod Function words attach to the content word they modify dobj The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT
Syntax nmod dobj aux case aux The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT Content words are related by dependency relations Function words attach to the content word they modify dobj aux Punctuation attach aux to head of phrase or clause The cat could have chased all the dogs down the street. DET NOUN AUX AUX VERB DET DET NOUN ADP DET NOUN PUNCT nmod case
pass case Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def pass nmod The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def nmod
pass Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def pass nmod The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def
pass auxpass nmod The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def
pass auxpass nmod case The dog was chased by the cat. DET NOUN AUX VERB ADP DET NOUN PUNCT pass nmod case Hunden jagades av katten. NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def nmod
Dependency Relations
Dependency Relations Taxonomy of 40 universal grammatical relations, broadly attested in language typology (de Marneffe et al., 2014) Language-specific subtypes may be added
Dependency Relations Taxonomy of 40 universal grammatical relations, broadly attested in language typology (de Marneffe et al., 2014) Language-specific subtypes may be added Organizing principles Three types of structures: nominals, clauses, modifiers Core arguments vs. other dependents (not complements vs. adjuncts)
Dependents of Clausal Predicates Nominal Clausal Other Core pass dobj iobj csubj csubjpass ccomp xcomp Non-Core nmod vocative discourse expl advcl advmod neg aux auxpass cop mark
nmod nmod aux dobj case advmod Mary was quietly reading a book in the garden. PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT advcl mark aux cop neg If you are sick, you should not exercise. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT ccomp mark aux xcomp Peter thought that he should stop smoking. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos mark
nmod nmod aux dobj case aux dobj case advmod advmod Mary was quietly reading book in the garden Mary was quietly reading a book in the garden. PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT advcl advcl mark mark aux aux cop neg cop neg If you are sick you should not exercise If you are sick, you should not exercise. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT ccomp ccomp mark mark aux aux xcomp xcomp Peter thought that he should stop smoking Peter thought that he should stop smoking. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos appos mark mark nmod
aux aux auxadvmod advmod advmod dobj dobj dobj nmod nmod nmod case case case Mary was quietly reading book in the garden Mary was quietly reading book in the garden PROPN Mary AUX was quietly ADV reading VERB DET a NOUN book ADP in DET the NOUN garden PUNCT. PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT mark mark mark cop cop cop advcl advcl advcl aux aux aux If you are sick you should not exercise If you are sick you should not exercise SCONJ If PRON you AUX are ADJ sick PUNCT, PRON you should AUX ADV not exercise VERB PUNCT. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT ccomp ccomp ccompmark mark mark aux aux aux xcomp xcomp xcomp Peter thought that he should stop smoking Peter thought that he should stop smoking PROPN Peter thought VERB SCONJ that PRON he should AUX VERB stop smoking VERB PUNCT. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos appos appos mark mark mark nmod neg neg neg
PROPN AUX ADV VERB DET NOUN ADP DET NOUN PUNCT advcl Dependents mark of Nominals aux cop neg If you are sick, you should not exercise. SCONJ PRON AUX ADJ PUNCT PRON AUX ADV VERB PUNCT Nominal Clausal Other nummod appos nmod acl ccomp mark amod case xcomp Peter thought that he should stop smoking. PROPN VERB SCONJ PRON AUX VERB VERB PUNCT appos mark amod nmod case Cairo, the lovely capital of Egypt PROPN PUNCT DET ADJ NOUN ADP PROPN
Coordination conj cc () appos mark Coordination Cairo, the lovely capital of Eg PROPN PUNCT DET ADJ NOUN ADP PR amod nmod case conj cc conj Huey, Dewey and Louie PROPN PUNCT PROPN CONJ PROPN Coordinate structures are headed by the first conjunct Subsequent conjuncts depend on it via the conj relation Conjunctions depend on it via the cc relation Punctuation marks depend on it via the relation
Multiword Expressions Relation mwe name compound goeswith Examples in spite of, as well as, ad hoc Roger Bacon, Carl XVI Gustaf, New York phone book, four thousand, dress up notwith standing, with out UD annotation does not permit words with spaces Multiword expressions are analysed using special relations The mwe, name and goeswith relations are always head-initial The compound relation reflects the internal structure
Other Relations Relation parataxis list remnant reparandum foreign dep Explanation Loosely linked clauses of same rank Lists without syntactic structure Orphans in ellipsis linked to parallel elements Disfluency linked to (speech) repair Elements within opaque stretches of code switching Unspecified dependency Syntactically independent element of clause/phrase
Language-Specific Relations Language-specific relations are subtypes of universal relations added to capture important phenomena Subtyping permits us to back off to universal relations Relation acl:relcl compound:prt nmod:poss nmod:agent cc:preconj :pre Explanation Relative clause Verb particle (dress up) Genitive nominal (Mary s book) Agent in passive (saved by the bell) Preconjunction (both and) Preerminer (all those )
Word Segmentation
Word Segmentation How do we segment sentences into words? Dependent on language and writing system, often non-trivial Segmentation must be reproducible on new data
Word Segmentation How do we segment sentences into words? Dependent on language and writing system, often non-trivial Segmentation must be reproducible on new data Two options provided: Only include words in treebank, but document segmentation Include mapping from low-level tokenisation to words in treebank
Word Segmentation How do we segment sentences into words? Dependent on language and writing system, often non-trivial Segmentation must be reproducible on new data Two options provided: Only include words in treebank, but document segmentation Include mapping from low-level tokenisation to words in treebank Vamos nos a el mar. VERB PRON ADP DET NOUN PUNCT Vámonos al mar.?? NOUN PUNCT
CoNLL-U Format Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID 1-2 1 2 3-4 3 4 5 6 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID 1-2 1 2 3-4 3 4 5 6 FORM Vámonos Vamos nos al a el mar. Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID 1-2 1 2 3-4 3 4 5 6 FORM Vámonos Vamos nos al a el mar. LEMMA ir nosotros a el mar. Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG 1-2 Vámonos 1 Vamos ir VERB 2 nos nosotros PRON 3-4 al 3 a a ADP 4 el el DET 5 mar mar NOUN 6... Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG 1-2 Vámonos 1 Vamos ir VERB 2 nos nosotros PRON 3-4 al 3 4 a el a el ADP DET 5 mar mar NOUN 6... Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 2 nos nosotros PRON PronType=Per Number=Plur Person=1 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 6... Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 expl 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 case nmod 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 expl 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 case nmod 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC 1-2 Vámonos 1 Vamos ir VERB Mood=Imp Number=Plur Person=1 0 2 nos nosotros PRON PronType=Per Number=Plur Person=1 1 expl 3-4 al 3 4 5 a el mar a el mar ADP DET NOUN Definite=Def Number=Sing Gender=Masc Number=Sing Gender=Masc 5 5 1 case nmod 6... 1 Revised version of the CoNLL-X format Two-level segmentation and secondary dependencies
Where are we now?
Where are we now? Universal Dependencies, Version 1 Guidelines released October 2014 Latest treebank release November 2015 (v1.2): Ancient Greek, Arabic, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Slovenian, Spanish, Swedish, Tamil
Where are we now? Universal Dependencies, Version 1 Guidelines released October 2014 Latest treebank release November 2015 (v1.2): Ancient Greek, Arabic, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Slovenian, Spanish, Swedish, Tamil Future plans: New releases every six months (May, November) Revision of guidelines as needed
Where are we now? Universal Dependencies, Version 1 Guidelines released October 2014 Latest treebank release November 2015 (v1.2): Ancient Greek, Arabic, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Slovenian, Spanish, Swedish, Tamil Future plans: New releases every six months (May, November) Revision of guidelines as needed Have a look at http://universaldependencies.org
So what exactly is UD?
So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies
So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing
So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing The ultimate annotation scheme? Not quite, more like a lingua franca for treebank developers and definitely useful for some annotation projects
So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing The ultimate annotation scheme? Not quite, more like a lingua franca for treebank developers and definitely useful for some annotation projects A universal grammar? Not in the Chomskyan sense, but hopefully in the more practical sense of facilitating multilingual NLP by bringing a little order into the chaos
So what exactly is UD? A new linguistic theory? Not at all, but we like to think it is informed by linguistic theory and potentially useful also for linguistic studies A better parsing framework? Probably not, since parsers seem to prefer function words as heads so we may have to tweak the representations for parsing The ultimate annotation scheme? Not quite, more like a lingua franca for treebank developers and definitely useful for some annotation projects A universal grammar? Well, who knows? Not in the Chomskyan sense, but hopefully in the more practical sense of facilitating multilingual NLP by bringing a little order into the chaos
Acknowledgments Core UD group: Filip Ginter, Yoav Goldberg, Jan Hajič, Chris Manning, Ryan McDonald, Natalia Silveira, Marie de Marneffe, Slav Petrov, Sampo Pyysalo, Reut Tsarfaty, Dan Zeman UD contributors: Željko Agić, Riyaz Ahmad, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Cristina Bosco, Giuseppe G. A. Celano, Jinho Choi, Çağrı Çöltekin, Kaja Dobrovoljc, Timothy Dozat, Binyam Ephrem, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Koldo Gojenola, Iakes Goenaga, Bruno Guillaume, Nizar Habash, Dag Haug, Anders Trærup Johannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Juha Kuokkala, Veronika Laippala, Alessandro Lenci, Krister Lindén, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Catalina Maranduc, Héctor Martínez Alonso, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Shinsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Jussi Piitulainen, Barbara Plank, Prokopis Prokopidis, Loganathan Ramasamy, Wolfgang Seeker, Mojgan Seraji, Maria Simi, Kiril Simov, Arne Skjæerholt, Aaron Smith, Jan Štěpánek,Takaaki Tanaka, Francis Tyers, Sumire Uematsu, Veronika Vincze, Rob Voigt, Jonathan Washington