Crossing Corpora. Modelling Semantic Similarity across Languages and Lects.



Similar documents
Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions. Natalia Levshina

Tracking change in word meaning

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

The study of words. Word Meaning. Lexical semantics. Synonymy. LING 130 Fall 2005 James Pustejovsky. ! What does a word mean?

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

2. SEMANTIC RELATIONS

Joint Research Centre

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE

Overview of MT techniques. Malek Boualem (FT)

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Lexical convergence in the Dutch lexicon

Acquiring grammatical gender in northern and southern Dutch. Jan Klom, Gunther De Vogelaer

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Clustering Connectionist and Statistical Language Processing

Comparing constructicons: A cluster analysis of the causative constructions with doen in Netherlandic and Belgian Dutch.

Varieties of lexical variation

Customizing an English-Korean Machine Translation System for Patent Translation *

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Statistical Machine Translation

Distributional Semantic Modelling in Cognitive Sociolinguistics: QLVL probes Semantic Space

Annotation Guidelines for Dutch-English Word Alignment

Young Learners English

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

Some Implications of Controlling Contextual Constraint: Exploring Word Meaning Inference by Using a Cloze Task

Getting Off to a Good Start: Best Practices for Terminology

INF4820, Algorithms for AI and NLP: More Common Lisp Vector Spaces

Improving Question Retrieval in Community Question Answering Using World Knowledge

Clustering of Polysemic Words

Constituency. The basic units of sentence structure

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Gallito 2.0: a Natural Language Processing tool to support Research on Discourse

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions

Master of Arts in Linguistics Syllabus

Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

THE BACHELOR S DEGREE IN SPANISH

New Developments in the Automatic Classification of Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Comprendium Translator System Overview

Ling 201 Syntax 1. Jirka Hana April 10, 2006

SPRING SCHOOL. Empirical methods in Usage-Based Linguistics

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Intro to Linguistics Semantics

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Clustering Technique in Data Mining for Text Documents

Mahesh Srinivasan. Assistant Professor of Psychology and Cognitive Science University of California, Berkeley

Modern foreign languages

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Natural Language Processing. Part 4: lexical semantics

Fiction: Poetry. Classic Poems. Contemporary Poems. Example. Key Point. Example

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Mining Text Data: An Introduction

A chart generator for the Dutch Alpino grammar

ON GETTING THE MOST OUT OF INTERNET RESOURCES TO RAISE TRANSLATION QUALITY OF PROFESSIONAL DOCUMENTATION

Hybrid Strategies. for better products and shorter time-to-market

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

Chapter 5. Phrase-based models. Statistical Machine Translation

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

A terminology model approach for defining and managing statistical metadata

Information Retrieval and Web Search Engines

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

Analyzing survey text: a brief overview

Enhancing Lotus Domino search

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

Processing: current projects and research at the IXA Group

PUSD High Frequency Word List

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

How To Teach English To Other People

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Interactive Dynamic Information Extraction

Databases and computerized information retrieval

CACAO PROJECT AT THE LOGCLEF TRACK

Transcription:

Distributional Models Bilectal Bilingual Crossing Corpora. Modelling Semantic Similarity across Languages and Lects. Yves Peirsman Supervisors: Dirk Geeraerts & Dirk Speelman Quantitative Lexicology and Variational Linguistics Conclusions

Background Corpus-based investigation of lexical variation jeans jeans jeansbroek spijkerbroek be 35% 45% 20% nl 20% 15% 65% Problem: semantic equivalence Solution: distributional models of semantic similarity Extension from one corpus to two corpora Application to two corpora from the same language: variational linguistics Application to two corpora from different languages: cross-language knowledge induction

Outline Distributional Models Semantic Similarity across two Lects Semantic Similarity across two Languages Conclusions and Outlook

Outline Distributional Models Semantic Similarity across two Lects Semantic Similarity across two Languages Conclusions and Outlook

Distributional Models Distributional hypothesis Semantically similar words appear in similar contexts (Wittgenstein, Firth, Harris) Recipe Ingredients: 1 large corpus, 1 definition of context, target words search the corpus for the contexts of your target words calculate the contextual similarity between each pair of target words find the most contextually similar target words

Distributional Models poor rich

Distributional Models poor 2 lawyer 5 rich

Distributional Models poor lawyer attorney rich

Distributional Models poor linguist lawyer attorney rich

Distributional Models poor linguist lawyer attorney rich

Distributional Models Definition of context context words: rich, poor several context sizes syntactic relations: subject of plead, object of eat documents or paragraphs The definition of context influences the type of semantic relation that is modelled. 3 evaluation exercises human similarity judgements EuroWordNet free associations

Human similarity judgements Distributional Models autograph shore 0.06 piloot fornuis 1.48 monk slave 0.57 sla reiger 2.28 glass jewel 1.78 konijn vlieg 8.79 bird crane 2.63 grasmachine stofzuiger 10.59 cemetery graveyard 3.88 goudvis haring 14.45 gem jewel 3.94 clementine sinaasappel 17.90 Data Dutch: 250M words of Belgian Dutch newspaper language English: BNC

Human similarity judgements Distributional Models 0.0 0.2 0.4 0.6 0.8 english S C2 C4 C6 C8 C10 0.0 0.2 0.4 0.6 0.8 dutch S C2 C4 C6 C8 C10

EuroWordNet Distributional Models number of nearest neighbours 0 100 200 300 400 cohyponym hypernym hyponym synonym syn 1 2 3 4 5 6 7 8 9 10 doc word space models

Distributional Models Free associations strawberry red, wave sea, pepper salt median rank of most frequent association 2 5 10 20 50 100 word based PMI log likelihood statistic compound based syntax based document based 2 4 6 8 10 context size

Distributional Models Conclusions Semantic similarity: syntax-based models and word-based models with small context sizes Semantic relatedness: document-based models Problem of data sparseness Modelling semantic similarity across languages and lects: word-based models with small context sizes

Outline Distributional Models Semantic Similarity across two Lects Semantic Similarity across two Languages Conclusions and Outlook

Cross-lectal similarity lorry taxi motorbike car motorcycle

Cross-lectal similarity lorry truck taxi cab motorbike car motorcycle

Cross-lectal similarity Extension of semantic similarity to two corpora Requirement: context features are identical between the corpora Bilectal models can help us recognize markers of a specific language variety extract the synonym from another language variety

Cross-lectal similarity stad/stad schepen wethouder burgemeester minister burgemeester minister beslissen/beslissen

Cross-lectal similarity Lectal markers traditional frequency-based definition of keywords different frequencies lectal markers? lectal markers different frequencies? new distributional definition Data 250M words of newspaper language from BE and NL Referentiebestand Belgisch Nederlands list of 4,000 markers of Belgian Dutch with their Netherlandic Dutch alternative

Cross-lectal similarity alternation BE NL meaning free ajuin ui onion aftrekker flesopener bottle opener unique doctorandus promovendus PhD student schepen wethouder alderman unlexicalised dodentol number of victims vieruurtje afternoon snack informal refter eetzaal canteen schoon mooi beautiful restrictions schacht first-year student kaka poep poop cultural CM Belgian health fund rijkswacht former national police

Lectal markers Cross-lectal similarity Look for words with an unexpectedly low cross-lectal distributional similarity Method finds more RBBN words than traditional keyword methods precision 0.00 0.10 0.20 0.30 log likelihood chi square stable lexical markers cosine

Cross-lectal similarity Cross-lectal synonyms Extract nearest Netherlandic neighbours to RBBN words. > 20% of the 1st nearest neighbours are RBBN synonyms. > 50% of the RBBN synonyms are in the top 15 neighbours. Belgian Netherlandic alternatives accuracy 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 context size 2 context size 3 context size 4 context size 5 context size 6 syntax 1 2 5 10 20 50 100 number of nearest neighbours

Cross-lectal similarity Examples of correct pairs Belgian marker Netherlandic alternative meaning bankbriefje bankbiljet bank note confituur jam jam fier trots proud gamma assortiment assortment job baan job kot studentenkamer student room living woonkamer living room microgolf magnetron microwave oven proper schoon clean uitbaten exploiteren exploit

Cross-lectal similarity Problem group 1: polysemous words Belgian marker alternative nearest neighbours katholiek catholic goed good katholiek catholic rooms-katholiek roman catholic protestants protestant stelling position steiger stelling position scaffold standpunt position opvatting opinion tenor tenor topper tenor tenor steersman countertenor counter tenor bariton baritone

Cross-lectal similarity Problem group 2: colloquial words Belgian marker alternative nearest neighbours bal ball/franc frank franc bal ball schot shot rust rest flessen fail zakken aanlengen dilute hergebruiken reuse recycle recycle boerenbuiten platteland vrouwenhuis women s house countryside buitenlieden countrymen huisschrijver resident author

Cross-lectal similarity Discussion Distributional approaches can model lexical variation Tool for variational-linguistic research Polysemy is a major problem Future work Application to other languages, other genres Addressing polysemy

Outline Distributional Models Semantic Similarity across two Lects Semantic Similarity across two Languages Conclusions and Outlook

Cross-lingual similarity Application of distributional models to two corpora can be extended to two languages Condition of overlap in context features Many languages share a number of words (cognates, loans) These words generally have a similar meaning.

Cross-lingual similarity Method 1. Identify the shared words between the two languages. e.g. manager, school, kind 2. Use these words as corresponding context words to find a small number of reliable translations e.g. auto car, boek book, kind child 3. Add these newly found translations to the set of corresponding context words 4. Repeat

Cross-lingual similarity kiezer/voter verkiezing election wet law minister/minister

Experiments Cross-lingual similarity Experiments for Dutch-English, Dutch-German, German-English, English-Spanish Additional stemming for English-Spanish Final results n noun adj verb total Dutch-German 3185.710.698.659.696 German-Dutch 3137.732.691.633.700 Dutch-English 3954.559.535.359.507 English-Dutch 3815.582.559.339.519 English-German 8042.487.538.368.467 German-English 7340.562.525.427.525 English-Spanish 6287.423.473.268.400 Spanish-English 5447.491.532.327.466

Cross-lingual similarity Accuracy improves through the process first translation accuracy 0.0 0.2 0.4 0.6 German Dutch Dutch German English German German English English Dutch Dutch English Spanish English English Spanish 0 1 2 3 4 5 iteration

Cross-lingual similarity Examples stad = city town area roman = novel story poem aap animal elephant = monkey vrouw = woman man child bedreigen = threaten protect attack leren = learn = teach know veronderstel imply regard = assume wemel? translate abound crowd

Cross-lingual similarity Applications Identification of false friends Bilingual knowledge induction monolingual selectional preference model schießen shoot deer Hirsch monolingual selectional preference model (schießen,obj,hirsch) German triple bilingual vector space (shoot,obj,deer) English triple

Cross-lingual similarity hunter murderer rabbit peacock deer duck

Cross-lingual similarity monolingual selectional preference model schießen shoot deer Hirsch monolingual selectional preference model (schießen,obj,hirsch) German triple bilingual vector space (shoot,obj,deer) English triple

Outline Distributional Models Semantic Similarity across two Lects Semantic Similarity across two Languages Conclusions and Outlook

Conclusions and Outlook Extension of semantic similarity to two corpora Application to lexical variation Identification of lectal markers Identification of cross-lectal synonyms Application to bilingual knowledge induction Basic translation of words between languages Generalizing knowledge from one language to the other

Polysemy Conclusions and Outlook Addressing polysemy involves a movement towards individual contexts Looking for a clustering of contexts that corresponds to the sense structure of a word stelling (BE) gebouw, renoveren argument, verdedigen

Polysemy Conclusions and Outlook Addressing polysemy involves a movement towards individual contexts Looking for a clustering of contexts that corresponds to the sense structure of a word stelling (BE) gebouw, renoveren stelling (NL) steiger (NL) argument, verdedigen

For more information: http://wwwling.arts.kuleuven.be/qlvl yves.peirsman@arts.kuleuven.be