Lexical convergence in the Dutch lexicon



Similar documents
Tracking change in word meaning

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions. Natalia Levshina

Distributional Semantic Modelling in Cognitive Sociolinguistics: QLVL probes Semantic Space

Comparing constructicons: A cluster analysis of the causative constructions with doen in Netherlandic and Belgian Dutch.

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions

Finding Syntactic Characteristics of Surinamese Dutch

Varieties of lexical variation

The Chat Box Revelation On the chat language of Flemish adolescents and young adults

Acquiring grammatical gender in northern and southern Dutch. Jan Klom, Gunther De Vogelaer

How To Identify And Represent Multiword Expressions (Mwe) In A Multiword Expression (Irme)

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Cultural Trends and language change

A chart generator for the Dutch Alpino grammar

Annotation Guidelines for Dutch-English Word Alignment

Explaining register and sociolinguistic variation in the lexicon: Corpus studies on Dutch

Applying quantitative methods to dialect Dutch verb clusters

Degrees of adverbialization

Crossing Corpora. Modelling Semantic Similarity across Languages and Lects.

LEXICOGRAPHIC ISSUES IN COMBINATORICS

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht,

Study programmes Faculty of Philosophy and Letters Master in Modern Languages and Literatures : German, Dutch and English, Teaching Focus

Thematic File Road Safety 5 Distraction in Traffic

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions

Distinguishing Prepositional Complements from Fixed Arguments

Exploring Topic Models for Word Sense Discrimination

Cross-lingual Synonymy Overlap

Project notes of CLARIN project DiscAn: Towards a Discourse Annotation system for Dutch language corpora

Processing: current projects and research at the IXA Group

Corpus-based acquisition of collocational prepositional phrases

~ We are all goddesses, the only problem is that we forget that when we grow up ~

Post-doctoral researcher, Faculty of Translation Studies, University College Ghent

ABN AMRO Bank N.V. The Royal Bank of Scotland N.V. ABN AMRO Holding N.V. RBS Holdings N.V. ABN AMRO Bank N.V.

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Corporate Security & Identity

SAND: Relation between the Database and Printed Maps

University of Marburg, RC Deutscher Sprachatlas University of Leuven, RU Quantitative Lexicology and Variational Linguistics

1 von 91 RMS WiSe 2014/15/Academic Working/Seiten/Startseite

XQuery and Data Extraction

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Semantic Clustering in Dutch

Master in Digital Humanities

NAAR NEDERLAND HANDLEIDING

Gert-Jan Put December 2014

Metadata for Corpora PATCOR and Domotica-2

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Clustering Connectionist and Statistical Language Processing

Tom Ruette, Dirk Speelman, Dirk Geeraerts. Quantitative Lexicology and Variational Linguistics, University of Leuven

Example-based Translation without Parallel Corpora: First experiments on a prototype

CLARIN project DiscAn :

#BusinessMeetsIT. Welcome. Seminar BI& Datacenters

Introduction VOLKER GAST. 1. Central questions addressed in this issue

Dialect Corpora Taken Further: The DynaSAND corpus and its application in newer tools

Comparing methods for automatic acquisition of Topic Signatures

Sentence Semantics. General Linguistics Jennifer Spenader, February 2006 (Most slides: Petra Hendriks)

NEDERBOOMS Treebank Mining for Data- based Linguistics. Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde

Off-line answer extraction for Dutch QA

Jolien Grandia MA MSc

Jolien Grandia MA MSc

Asking what. a person looks like some persons look like

Hybrid Strategies. for better products and shorter time-to-market

The polysemy of lexeme formation rules

Your boldest wishes concerning online corpora: OpenSoNaR and you

Timeline (1) Text Mining Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Curriculum Vitae - Maike van Damme

Pascale Simons - Curriculum Vitae

Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY

ehg New Trends in e Humanities Amsterdam

How To Kill A Black Person

Nominative-Dative Inversion and the Decline of Dutch

Sum of all paintings opening slide Introduce myself. Nlwp, Commons, Wikidata, GLAMwiki, bots, Wiki Loves Monuments, uploads, Based on Wikimania 2015

29. Non-standard varieties in the new media

ž Lexicalized Tree Adjoining Grammar (LTAG) associates at least one elementary tree with every lexical item.

Long, often quite boring, notes of meetings

Ling 201 Syntax 1. Jirka Hana April 10, 2006

De veranderende samenleving. De smartphone revolutie

Voorbeeld. Preview ISO INTERNATIONAL STANDARD. Cranes Requirements for test loads

Verbondstraat 52, B Antwerp, Belgium Telephone

Taal & Tongval Borrowing. Pragmatic and Variational Linguistic Approaches. Gent, KANTL, 27 November

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Conceptual Change Digital Humanities Case Studies (7-8 December 2015)

TAPAS. Arnt Offringa Director R&D Fokker Aerostructures Ypenburg, 12 juni 2013

Distraction among bicyclists and pedestrians

Market Intelligence & Research Services. CRM Trends Overview. MarketCap International BV Januari 2011

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Lexical Competition: Round in English and Dutch

Kansen in KP7 NMP. Aansluitend op de HTSM Roadmap Nanotechnologie. 11 juni Melvin A. Kasanrokijat

PoliticalMashup. Make implicit structure and information explicit. Content

Curriculum Vitae. Sophie Op de Beeck

Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch

Tooway. consumenten prijzen KASAT. Tooway Nederland Rian BV Bergstraat BL Waalre Prijzen zijn excl 21% BTW

Natural answer presentation. through revision. of syntactic patterns

Introduction. 1.1 Kinds and generalizations

Schadevoorziening bij brand- en bouwveiligheid

Students Conceptions of Constructivist Learning. Sofie Loyens

Summary CHILD ABUSE. Leiden Attachment Research Program

Brill s rule-based PoS tagger

How is life in your region? North of the Netherlands from an international perspective. Resilience of regions. Well-being in the region

MULTIFUNCTIONAL DICTIONARIES

Transcription:

Overview Introduction Dutch Method Results Lexical convergence in the Dutch lexicon Jocelyne Daems Kris Heylen Dirk Geeraerts University of Leuven RU Quantitative Lexicology and Variational Linguistics Conclusion

Overview 1. Introduction 2. Dutch 3. Method Uniformity measure Data 4. Results 5. Conclusion References

1. Introduction Variation in Dutch Lectal variation register region gender generation profession and education Linguistic variation vrachtwagen of vrachtauto (of camion of truck) dat ken/kan/kun jij wel zeggen zeven of zeuven

1. Introduction Variation in Dutch Lectal variation register region gender generation profession and education Linguistic variation vrachtwagen of vrachtauto dat ken/kan jij wel zeggen zeven of zeuven A specific context can come with its own set of linguistic choices lect

1. Introduction Variation in Dutch in this study Lectal variation region and register geographical and stylistic/stratificational question Linguistic variation lexicon onomasiological question

1. Introduction Research aim Measure lexical convergence between Dutch in Belgium and Dutch in the Netherlands by means of the word choice for emotion-, IT- and traffic concepts. Cf. Geeraerts, Grondelaers & Speelman 1999

2. Dutch Dutch Two national varieties (pluricentric language (Clyne 1992)) Netherlandic Dutch Belgian Dutch Stratification Poldernederlands? Verkavelingsvlaams, soapvlaams, tussentaal in-between language

2. Dutch Standardisation Different development of the standardisation process the Netherlands: indepdent development of a/the standard variant of Dutch Belgium: after a long delay, promotion of convergence with the (established) Netherlandic Dutch norm Two opposed views in Belgium Integrationism: Netherlandic Dutch as the norm Particularism: room for a Belgian Dutch standard variant

3. Method Geeraerts, Grondelaers & Speelman 1999 uniformity measure for onomasiological variation to what extent do people for a given concept choose the same words? how often do Belgians and Dutchmen express vrachtwagen ( truck ) in the same way?

3. Method - Uniformity measure Onomasiological profile concept synonym1 synonym2 synony3 synonym4 absfreq absfreq absfreq absfreq relfreq relfreq relfreq relfreq

3. Method - Uniformity measure Onomasiological profile: concrete vrachtwagen vrachtwagen vrachtauto truck camion 11658 130 1680 104 86% 1% 12% 1%

3. Method - Uniformity measure Measure lexical convergence: five steps 1. Per concept build the onomasiological profile 2. Per variety build the onomasiological profile 3. Measure uniformity 4. Aggregate over all concepts 5. (Weigh the uniformity measure by means of the concept frequency)

3. Method - Uniformity measure Measure lexical convergence: concrete vrachtwagen vrachtwagen vrachtauto truck camion BE NL BE NL BE NL BE NL 86% 74% 1% 12% 12% 14% 1% 0%

3. Method - Uniformity measure Measure lexical convergence: five steps 1. Per concept build the onomasiological profile 2. Per variety build the onomasiological profile 3. Measure uniformity = overlap in lexicalisation preferences = sum of the smallest relative value for each term 4. Aggregate over all concepts 5. (Weigh the uniformity measure by means of the concept frequency)

3. Method - Uniformity measure Measure lexical convergence: concrete vrachtwagen vrachtwagen vrachtauto truck camion BE NL BE NL BE NL BE NL 86% 74% 1% 12% 12% 14% 1% 0% 74% + 1% + 12% + 0% = 87% measure of uniformity

3. Method - Uniformity measure Measure lexical convergence: five steps 1. Per concept build the onomasiological profile 2. Per variety build the onomasiological profile 3. Measure uniformity 4. Aggregate over all concepts = repeat for each concept in the semantic field + compute the mean 5. (Weigh the uniformity measure by means of the concept frequency)

3. Method - Uniformity measure Measure lexical convergence: concrete U vrachtwagen 87% metro 97% trein 100% stoep 61% bestelwagen 45% mean 78%

3. Method - Uniformity measure Measure lexical convergence: five steps 1. Per concept build the onomasiological profile 2. Per variety build the onomasiological profile 3. Measure uniformity 4. Aggregate over all concepts 5. Weigh the uniformity measure by means of the concept frequency = less frequent concepts weigh less

3. Method - Uniformity measure Measure lexical convergence: concrete U W U vrachtwagen 87% 32% 28% metro 97% 8% 7% trein 99% 46% 45% stoep 61% 10% 6% bestelwagen 45% 4% 2% mean 78% sum 89%

3. Method - Data Data gathering Four steps Select profiles Lemmatised corpora Script frequencies and concordances Manual disambiguation

Profiles traffic 3. Method - Data

Profiles IT 3. Method - Data

Profiles emotions 3. Method - Data

3. Method - Data A sample of annotated text by Alpino [ Na, prep, na, 3849, 0], [ een, det, een, 3849, 1], [ jaar, noun, jaar, 3849, 2], [ -, punct, -, 3849, 3], [ ik, pron, ik, 3849, 4], [ deed, verb, doe, 3849, 5], [ alles, noun, alles, 3849, 6], [,, punct,,, 3849, 7], [ van, prep, van, 3849, 8], [ was, noun, was, 3849, 9], [ ophangen, verb, hang op, 3849, 10], [ tot, prep, tot, 3849, 11], [ de, det, de, 3849, 12], [ beenhouwerij, noun, beenhouwerij, 3849, 13], [ kuisen, verb, kuis, 3849, 14], [ -, punct, -, 3849, 15], [ vroeg, verb, vraag, 3849, 16], [ ik, pron, ik, 3849, 17], [ of, comp, of, 3849, 18], [ ik, pron, ik, 3849, 19], [ meer, det, meer, 3849, 20], [ kon, verb, kan, 3849, 21], [ verdienen, verb, verdien, 3849, 22], [., punct,., 3849, 23], [ De, det, de, 3850, 0], [ slager, noun, slager, 3850, 1], [ wees, verb, wijs af, 3850, 2], [ dat, det, dat, 3850, 3], [ verzoek, noun, verzoek, 3850, 4], [ af, part, af, 3850, 5], [., punct,., 3850, 6], [ Ik, pron, ik, 3851, 1], [ pakte, verb, pak, 3851, 2], [ m n, det, mijn, 3851, 3], [ velo, noun, velo, 3851, 4], [ en, vg, en, 3851, 5], [ ik, pron, ik, 3851, 6], [ was, verb, ben, 3851, 7], [ weg, adv, weg, 3851, 8], [., punct,., 3851, 9] Na een jaar ik deed alles, van was ophangen tot de beenhouwerij kuisen vroeg ik of ik meer kon verdienen. De slager wees dat verzoek af. Ik pakte m n velo en ik was weg.

3. Method - Data Manual desambiguation Stevaert vindt de sluiting van de kleine ring rondom Antwerpen een topprioriteit. relevant meaning Al draagt een aap een gouden ring, het is en blijft een lelijk ding. wrong meaning: jewellery Wat is de kick? Als ik in de ring oog in oog sta met mijn tegenstander. wrong meaning: boxing ring

Profiles 80 concepts 2 parts of speech 3 lexical fields 3. Method - Data traffic: vrachtwagen ( truck ) IT: gsm ( cell phone ) emotions: verlegen ( shy ), oprechtheid ( sincerity ) Corpora newspaper Usenet Belgium 373M 18M the Netherlands 161M 24M

4. Results Hypotheses 1. Diachronic convergence = word choice in Belgium and the Netherlands becomes more uniform 2. Synchronic stratification = word choice in the Netherlands is more uniform than in Belgium = word choice in newspapers is more uniform than in Usenet 3. Role of the semantic field 4. Role of the concept frequency

4. Results Hypothesis 1: Diachronic convergence Study : Geeraerts, Grondelaers & Speelman 1999 Data : clothing terms in fashion magazines U (B50,N50) 70% U (B70,N70) 75% U (B90,N90) 82% Study : De Cnodder 2013 Data : clothing terms in fashion magazines U (B12,N12) 78%

4. Results Hypothesis 1: Diachronic convergence Study : Geeraerts, Grondelaers & Speelman 1999 Data : football terms in magazines in 1990 U (B50,N50) 66% U (B70,N70) 72% U (B90,N90) 77%

4. Results Hypothesis 1: Diachronic convergence Study : Grondelaers, Van Aken, Speelman & Geeraerts 2001 Data : clothing terms in newspapers in 1998 U (B58,N58) 71% U (B78,N78) 64% U (B98,N98) 82% Study : Grondelaers, Van Aken, Speelman & Geeraerts 2001 Data : prepositions in newspapers in 1998 U (B58,N58) 59% U (B78,N78) 62% U (B98,N98) 67%

4. Results Hypothesis 1: Diachronic convergence Convergence for the supraregional/formal language use 50 70 90 / 58 78 98 Shift in Belgian Dutch, which converges towards Netherlandic Dutch! Situation in 2012: stagnation or divergence? Informalisation of Netherlandic Dutch?! Balanced composition of the corpus! Content words vs. function words

4. Results Hypothesis 2: Synchronic stratification Study : Geeraerts, Grondelaers & Speelman 1999 Data : clothing terms in fashion magazines and shop windows in 1990 U (Bmag,B etal ) 50% U (Nmag,N etal ) 69% Study : De Cnodder 2013 Data : clothing terms in fashion magazines and shop windows in 2012 U (Bmag,B etal ) 63% U (Nmag,N etal ) 79%

U (Bkrant,B Usenet ) 82% U (Bkrant,B IRC ) 75% U (Nkrant,N Usenet ) 97% U (N,N ) 94% 4. Results Hypothesis 2: Synchronic stratification Study : Grondelaers, Van Aken, Speelman & Geeraerts 2001 Data : clothing terms in newspapers, Usenet and IRC in 1998 U (Bkrant,B Usenet ) 90% U (Bkrant,B IRC ) 83% U (Nkrant,N Usenet ) 92% U (Nkrant,N IRC ) 90% Study : Grondelaers, Van Aken, Speelman & Geeraerts 2001 Data : prepositions in newspapers, Usenet and IRC in 1998

4. Results Hypothesis 2: Synchronic stratification Study : Daems, Heylen & Geeraerts (In prep.) Data : traffic terms in newspapers and Usenet in 1999-2005 U (Bkrant,B Usenet ) 87% U (Nkrant,N Usenet ) 86% Study : Daems, Heylen & Geeraerts (In prep.) Data : IT terms in newspapers and Usenet in 1999-2005 U (Bkrant,B Usenet ) 72% mailinglist, provider, database U (Nkrant,N Usenet ) 70% mailinglist, provider, password

4. Results Hypothesis 2: Synchronic stratification Uniformity is lower in Belgian Dutch In Belgian Dutch there is a continuum from regional/informal (IRC) to supraregional/formal (newspapers) language use in which Usenet takes an intermediary position! Importance semantic field

4. Results Hypothesis 2: Synchronic stratification (bis) Studies : Geeraerts, Grondelaers & Speelman 1999 Data : clothing terms in fashion magazines in 1990 and 2012, newspapers, Usenet and IRC in 1998 U (Bmag,90,N mag,90 ) 82% U (Bmag,12,N mag,12 ) 78% U (Bkrant,N krant ) 82% U (BUsenet,N Usenet ) 73% U (BIRC,N IRC ) 75%

4. Results Hypothesis 2: Synchronic stratification (bis) Study : Daems, Heylen & Geeraerts (In prep.) Data : traffic terms in newspapers and Usenet in 1999-2005 U (Bkrant,N krant ) 78% U (BUsenet,N Usenet ) 72% bus, trein, vrachtwagen Study : Daems, Heylen & Geeraerts (In prep.) Data : IT terms in newspapers and Usenet in 1999-2005 U (Bkrant,N krant ) 74% U (BUsenet,N Usenet ) 91% website, link, provider

4. Results Hypothesis 2: Synchronic stratification (bis) On a supraregional level the standard variant is more uniform than the informal variant(s)! Importance of the semantic field

4. Results Hypothesis 3: Role of the semantic field Studies : Geeraerts, Grondelaers & Speelman 1999, Grondelaers, Van Aken, Speelman & Geeraerts 2001 Data : football terms in magazines in 1990, prepositions and clothing terms in newspapers in 1998 U (Bkleding,N kleding ) 82% U (voorzetsel,n voorzetsel ) 67% U (Bvoetbal,N voetbal ) 77%

4. Results Hypothesis 3: Role of the semantic field Study : Daems, Heylen & Geeraerts (In prep.) Data : traffic, IT and emotion terms in newspapers in 1999-2005 U (Bverkeer,N verkeer ) 78% U (BICT,N ICT ) 74% U (Bemotie,noun,N emotie,noun ) 81% U (Bemotie,adj,N emotie,adj ) 79%

4. Results Hypothesis 3: Role of the semantic field Age of the field, degree of supraregional contact, register! Statistic significance: requires enough data

4. Results Hypothesis 4: Role of the concept frequency Study : Daems, Heylen & Geeraerts (In prep.) Data : traffic, IT and emotion terms in newspapers in 1999-2005 non-weighted (U) weighted (U ) (B verkeer,n verkeer ) 70% 78% (B ICT,N ICT ) 79% 74% (B emotie,noun,n emotie,noun ) 83% 81% (B emotie,adj,n emotie,adj ) 78% 79%

4. Results Hypothesis 4: Role of the concept frequency Study : Daems, Heylen & Geeraerts (In prep.) Data : traffic, IT and emotion terms in newspapers in 1999-2005 non-weighted (U) weighted (U ) (B verkeer,n verkeer ) 70% 78% (B ICT,N ICT ) 79% 74% traffic 70% 78% < bus, fiets, trein, vrachtwagen IT 79% 74% < mobiele telefoon, e(-)mail

4. Results Hypothesis 4: Role of the concept frequency Know your data qualitatively

5. Conclusion Confirmation Diachronic convergence Synchronic stratification Attention! Importance of the semantic field! Importance of the concept frequency

5. Conclusion Future Semasiological perspective Correlation between onomasiological and semasiological perspective Automation: techniques from distributional semantics Vector Space Models (VSMs) Type-based VSMs Token-based VSMs

5. Conclusion Future - Semasiological perspective Onomasiology: given a concept - how is it named? Semasiology: given a word - what does it mean? Bv. autosnelweg vs. ring ring sieraad boksring ringweg jaarring kring

5. Conclusion Future - Semasiological perspective Case study traffic: Onomasiologically a higher uniformity between Belgian Dutch and Netherlandic Dutch than semasiologically, in other words, we rather name things the same than that words have the same meaning.

5. Conclusion Future - Distributional semantics - Vector Space Models You shall know a word by the company it keeps (Firth) Type-based onomasiology: detect synonyms Token-based semasiology: detect senses

5. Conclusion

5. Conclusion

Thank you! Questions? Further information? http://wwwling.arts.kuleuven.be/qlvl jocelyne.daems@arts.kuleuven.be

References 1/2 Bouma, Gosse, Gertjan van Noord & Rob Malouf. 2001 Alpino: wide-coverage computational analysis of Dutch. In: Walter Daelemans, et al. (eds). Computational Linguistics in the Netherlands 2000. Amsterdam: Rodopi, 45-59. Clyne, Michael. 1992. Pluricentric languages: differing norms in different nations. Berlin/New York: Mouton de Gruyter. De Cnodder, Tine. 2013. Convergentie en divergentie in de Nederlandse woordenschat anno 2012. Unpublished MA thesis. KU Leuven. Geeraerts, Dirk, Stefan Grondelaers & Dirk Speelman. 1999. Convergentie en divergentie in de Nederlandse woordenschat: een onderzoek naar kleding- en voetbaltermen. Amsterdam: P.J. Meertens-Instituut. Grondelaers, Stefan, Dirk Geeraerts, Dirk Speelman and José Tummers. 2001. Lexical standardisation in internet conversations. Comparing Belgium and The Netherlands. In: Fontana, McNally, Turell and Enric Vallduví (eds.), Proceedings of the First International Conference on Language Variation in Europe, 90 100. Barcelona: Universitat Pompeu Fabra, Institut Universitari de Lingüística Aplicada, Unitat de Investigació de Variació Lingüística.

References 2/2 Grondelaers, Stefan, Hilde Van Aken, Dirk Speelman & Dirk Geeraerts. 2001. Inhoudswoorden en preposities als standaardiseringsindicatoren. De diachrone en synchronic status van het Belgische Nederlands. Nederlandse Taalkunde 6: 179 202. Heylen, Kris, Dirk Speelman & Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets. In Miriam Butt, Sheelagh Carpendale, Gerald Penn, Jelena Prokic & Michael Cysouw (eds.), Proceedings of the EACL-2012 joint workshop of LINGVIS & UNCLH: Visualization of Language Patters and Uncovering Language History from Multilingual Resources, 1624. Stroudsburg: Association for Computational Linguistics. Speelman, Dirk, Stefan Grondelaers & Dirk Geeraerts. 2003. Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers and the Humanities 37 (3): 317-337. Wielfaert, Thomas, Kris Heylen & Dirk Speelman. 2013. Visualisations interactives des espaces vectoriels sémantiques pour l analyse lexicologique. Actes de SemDis 2013 : Enjeux actuels de la sémantique distributionnelle:154-166.