Tracking change in word meaning



Similar documents
Distributional Semantic Modelling in Cognitive Sociolinguistics: QLVL probes Semantic Space

Lexical convergence in the Dutch lexicon

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions. Natalia Levshina

Comparing constructicons: A cluster analysis of the causative constructions with doen in Netherlandic and Belgian Dutch.

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

University of Marburg, RC Deutscher Sprachatlas University of Leuven, RU Quantitative Lexicology and Variational Linguistics

Statistics for BIG data

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions

The Value of Visualization 2

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Applying quantitative methods to dialect Dutch verb clusters

Search Engines. Stephen Shaw 18th of February, Netsoc

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Get the most value from your surveys with text analysis

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Information Visualization WS 2013/14 11 Visual Analytics

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Clustering Connectionist and Statistical Language Processing

Big data, the future of statistics

Sentiment analysis on tweets in a financial domain

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

An Introduction to Random Indexing

Varieties of lexical variation

Imputing Values to Missing Data

Computer-aided Document Indexing System

Gallito 2.0: a Natural Language Processing tool to support Research on Discourse

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Crossing Corpora. Modelling Semantic Similarity across Languages and Lects.

Big Data: Rethinking Text Visualization

INF4820, Algorithms for AI and NLP: More Common Lisp Vector Spaces

Exploratory Data Analysis with MATLAB

What the Hell is Big Data?

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Statistical Models in Data Mining

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Utilizing spatial information systems for non-spatial-data analysis

Visualization methods for patent data

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

Text Mining - Scope and Applications

Knowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD

Master of Artificial Intelligence

On the use of antonyms and synonyms from a domain perspective

MINISTRY OF DEFENCE LANGUAGES EXAMINATIONS BOARD

Acquiring grammatical gender in northern and southern Dutch. Jan Klom, Gunther De Vogelaer

Special Topics in Computer Science

HOTEL INFORMATION 2009

IBM SPSS Text Analytics for Surveys

Active Learning SVM for Blogs recommendation

Visibility optimization for data visualization: A Survey of Issues and Techniques

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Computer Aided Document Indexing System

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Time series clustering and the analysis of film style

Exploratory Data Analysis with #codemash

Visual Discovery in Multivariate Binary Data

Information Visualization Multivariate Data Visualization Krešimir Matković

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Cours de Visualisation d'information InfoVis Lecture. Multivariate Data Sets

Lexical Competition: Round in English and Dutch

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

A Statistical Text Mining Method for Patent Analysis

2014/02/13 Sphinx Lunch

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Exploratory Spatial Data Analysis

Monitoring chemical processes for early fault detection using multivariate data analysis methods

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

Towards a Visually Enhanced Medical Search Engine

Knowledge Discovery from patents using KMX Text Analytics

The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project

Hybrid Strategies. for better products and shorter time-to-market

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Exploratory Data Analysis with R

Text Analytics. A business guide

Transcription:

Overview Intro DisSem Previous Case Visualisation Conclusion References Tracking change in word meaning A dynamic visualization of diachronic distributional semantics Kris Heylen, Thomas Wielfaert & Dirk Speelman KULeuven Quantitative Lexicology and Variational Linguistics

Purpose of the talk A lexicological study of how a set of near-synonymous adjectives have changed meaning through time, using a statistical, distributional approach for modelling lexical semantics in large corpora, using a dynamic visualization to assist in interpreting these statistical patterns, with the ultimate goal of creating an exploritative tool for lexical semantic analysis.

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

Background: Lexical Variation LEXICOLOGY:

Background: Lexical Variation LEXICOLOGY

Background: Lexical Variation LEXICOLOGY: SEMASIOLOGICAL PERSPECTIVE

Background: Lexical Variation LEXICOLOGY: ONOMASIOLOGICAL PERSPECTIVE

Background: Lexical Variation LEXICOLOGY: FINER GRAINED ANALYSIS OF SEMANTIC FEA- TURES

Background: Lexical Variation LEXICOLOGY: FINER GRAINED ANALYSIS OF SEMANTIC FEA- TURES

Background: Lexical Variation LEXICOLOGY: LECTAL VARIATION

Background: Lexical Variation LEXICOLOGY: CHRONO-LECTAL (DIACHRONIC) VARIATION

Background: Lexical Variation LEXICOLOGY: QUANTITATIVE CORPUS ANALYSIS

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

Distributional models of lexical semantics Linguistic origin: Distributional Hypothesis You shall know a word by the company it keeps (Firth) a word s meaning can be induced from its co-occurring words long tradition of collocation studies in corpus linguistics Semantic Vector Spaces in Computational Linguistics standard technique in statistical NLP for the large-scale automatic modeling of (lexical) semantics aka Vector Spaces Models, Distributional Semantic Models, Word Spaces,... (cf Turney & Pantel 2010 for overview) generalised, large scale collocation analysis mainly used for automatic thesaurus extraction: words occurring in same contexts have similar meaning

Semantic Vector Spaces as models of word meaning Practical Which two words out of a set of three have the same meaning? ongeval, koffie, accident Occurrences in context from a corpus Op de Brusselse ring deed zich een ongeval met een vrachtwagen voor s Morgens drinkt hij een kop koffie met melk en suiker 2 bestuurders raakten gekwetst bij een ongeval met een vrachtwagen in de avondspits veroorzaakte een accident een kilometerslange file als vieruurtje serveert het hotel koffie en gebak voor de gasten de auto was betrokken in een accident met een dodelijke afloop Met winterbanden is het risico op een ongeval bij vriesweer veel kleiner

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 0 0 0 0 0 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 0 0 1 0 1 0 0 0 vader raakte gekwetst bij een ongeval met een vrachtwagen op de

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 0 0 1 1 1 0 0 0 voor zeven uur veroorzaakte een ongeval een kilometerslange file richting Antwerpen

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 0 1 2 1 1 0 0 0 vrachtwagens waren betrokken bij het ongeval, dat meer dan tien slachtoffers

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 1 2 2 1 1 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 1 3 2 1 2 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 1 4 2 1 4 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 37 83 142 17 66 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 0 1 2 2 0 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 1 3 2 4 1 0 0 0

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 10 17 22 7 0 0 0 1

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 53 121 67 24 55 2 0 3

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5 koffie 0 0 0 0 0 1 2 2

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5 koffie 0 0 0 0 0 3 5 4

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5 koffie 0 0 2 0 0 16 24 21

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5 koffie 3 5 11 1 0 55 76 64

auto slachtoffer vrachtwagen file gekwetst suiker melk kop ongeval 120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5 koffie 5 8 18 4 1 72 102 93 Which words are similar?

Distributional models of lexical semantics word by word similarity matrix ongeval accident koffie ongeval 1.91.08 accident.91 1.17 koffie.08.17 1

Distributional models of lexical semantics Geometrical metaphor: Semantic distance frequencies weighted by collocational strength (pmi) vectors projected in context feature space: Word Space cosine of angle between vectors as semantic similarity measure

Distributional semantics: lexical variation Bilectal Word Spaces Extend Word Space from one corpus to two corpora representative for different lects/varieties 2 context vectors for each word, one for each variety most words will have themselves as most similar word... BUT words with diverging semantic structure will not

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

Sagi, Kaufmann & Clark 2009

Rohrdantz, Hautli, Mayer etal. 2011

Hilpert 2011

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

Case study: : positive evaluative adjectives

Case study: positive evaluative adjectives brilliant cool delightful excellent fabulous fantastic good great impressive lovely magnificent marvelous perfect splendid superb terrific wonderful Table: positive evaluative adjectives

Case study Corpus Corpus of Historical American English (COHA, Davies 2012) Period from 1810 to 2009, 400M words, POS-tagged. Concept: Positive evaluative adjectives 1 vector per adjective, per decade (1860-2009) modelled by window of 5 words left & right 5000 most frequent context words (minus top 100) PMI-weighting, cosine similarity

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

HighD to 2D Visualisation word-decade by context matrix is high dimensional first aim is NOT to find latent structure (as with LSA/LDA) but general picture of distributional semantic structuring faithful rendering of similarity matrix in 2D: Kruskal s non-metric Multidimensional Scaling interpret dimensions with context-labeled clusters Dynamic and interactive chart Motion Charts from Google Chart Tools panchronic view to interpret semantic space diachronic view to see meaning changes.

panchronic view for interpretation of semantic space Clusters with most typical contextwords of adjectives: cluster 2 (centre, light blue): positive evaluated things (colors, spectacle, performance) centre of the plot, expressing the core meaning of the adjectives cluster 8 (red, lower left): loud and frightening things (explosion, thunder, crash) periphery of the plot, expressing non-related meaning

diachronic motion chart to see meaning change Trajectory of terrific from 1860 to 2000, moving from the peripheral cluster of frightening things to the central cluster of positive evaluated things, indicative of its meaning change

Overview 1. Background: Lexical variation 2. Distributional semantics 3. Previous Visualisations 4. Case Study: positive evaluative adjectives 5. Dynamic Visualisation of semantic change 6. Conclusion

Summary Conclusion and future work Lexicological perspective: Tool for exploring lexical semantics and variation in large amounts of corpus data Dynamic visualisation of evolving semantic structuring for a set of near-synonymous adjectives Desiderata integrate with latent dimension finding techniques (cf. Rohrdantz et al.) for easier interpretation of semantic space show individual occurrences of lexemes (tokens) to explore semasiological structure of adjectives in each decade show interpretative beacons in the dynamic plot other types of context features (e.g. dependency relations)

For more information: http://wwwling.arts.kuleuven.be/qlvl kris.heylen@arts.kuleuven.be thomas.wielfaert@arts.kuleuven.be dirk.speelman@arts.kuleuven.be

References I Davies, Mark. 2010. Corpus of Historical American English (COHA): ): 400+ million words, 18102009. Heylen, Kris, Speelman, Dirk, & Geeraerts, Dirk. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets. Pages 16 24 of: Proceedings of the EACL-2012 joint workshop of LINGVIS & UNCLH: Visualization of Language Patters and Uncovering Language History from Multilingual Resources. Hilpert, Martin. 2011. Dynamic visualizations of language change: Motion charts on the basis of bivariate and multivariate data from diachronic corpora. International Journal of Corpus Linguistics, 16(4), 435 461.

References II Rohrdantz, Christian, Hautli, Annette, Mayer, Thomas, Butt, Miriam, Keim, Daniel A, & Plank, Frans. 2011. Towards Tracking Semantic Change by Visual Analytics. Pages 305 310 of: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics. Sagi, Eyal, Kaufmann, Stefan, & Clark, Brady. 2009. Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space. Pages 104 111 of: Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Athens, Greece: Association for Computational Linguistics. Turney, Peter D., & Pantel, Patrick. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37(1), 141 188.