SI485i : NLP. Set 10 Lexical Relations. slides adapted from Dan Jurafsky and Bill MacCartney

Similar documents
Natural Language Processing. Part 4: lexical semantics

The study of words. Word Meaning. Lexical semantics. Synonymy. LING 130 Fall 2005 James Pustejovsky. ! What does a word mean?

Intro to Linguistics Semantics

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Building a Question Classifier for a TREC-Style Question Answering System

Visualizing WordNet Structure

The Lexicon. The Lexicon. The Lexicon. The Significance of the Lexicon. Australian? The Significance of the Lexicon 澳 大 利 亚 奥 地 利

2. SEMANTIC RELATIONS

A typology of ontology-based semantic measures

An Integrated Approach to Automatic Synonym Detection in Turkish Corpus

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Package wordnet. January 6, 2016

The Open University s repository of research publications and other research outputs. PowerAqua: fishing the semantic web

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Language Meaning and Use

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Semantic analysis of text and speech

Cross-lingual Synonymy Overlap

Words with Attitude. Jaap Kamps and Maarten Marx. Abstract. 1 Introduction

Performance Indicators-Language Arts Reading and Writing 3 rd Grade

SPELLING WORD #1: SENTENCE:

TERMINOGRAPHY and LEXICOGRAPHY What is the difference? Summary. Anja Drame TermNet

Word Sense Disambiguation as an Integer Linear Programming Problem

Words with Attitude. Abstract. 1 Introduction

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Extracting user interests from search query logs: A clustering approach

COMPARISON OF PUBLIC SCHOOL ELEMENTARY CURRICULUM AND MONTESSORI ELEMENTARY CURRICULUM

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

Semantic Analysis of. Tag Similarity Measures in. Collaborative Tagging Systems

Clever Search: A WordNet Based Wrapper for Internet Search Engines

An empirical study of semantic similarity in WordNet and Word2Vec. A Thesis

Ling 201 Syntax 1. Jirka Hana April 10, 2006

PUSD High Frequency Word List

Semantic Similarity Match for Data Quality

Mr. Anker Tests Current Listing of Activities, December 2008

Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8

English-Arabic Dictionary for Translators

Introduction to WordNet: An On-line Lexical Database. (Revised August 1993)

Going Faster: Testing The Web Application. By Adithya N. Analysis and Testing of Web Applications Filippo Ricca and Paolo Tonella

Proofreading and Editing:

AK + ASD Writing Grade Level Expectations For Grades 3-6

Clustering of Polysemic Words

Methods and Tools for Encoding the WordNet.Br Sentences, Concept Glosses, and Conceptual-Semantic Relations

Thesis Proposal Verb Semantics for Natural Language Understanding

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

This image cannot currently be displayed. Course Catalog. Language Arts Glynlyon, Inc.

Teaching Vocabulary to Young Learners (Linse, 2005, pp )

Interpreting areading Scaled Scores for Instruction

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Natural Language Database Interface for the Community Based Monitoring System *

Click to edit Master title style

This image cannot currently be displayed. Course Catalog. Language Arts Glynlyon, Inc.

SPELLING DOES MATTER

Syntax: Phrases. 1. The phrase

Identifying Focus, Techniques and Domain of Scientific Papers

A. Schedule: Reading, problem set #2, midterm. B. Problem set #1: Aim to have this for you by Thursday (but it could be Tuesday)

Clustering Connectionist and Statistical Language Processing

1. Define and Know (D) 2. Recognize (R) 3. Apply automatically (A) Objectives What Students Need to Know. Standards (ACT Scoring Range) Resources

Focus: Reading Unit of Study: Research & Media Literary; Informational Text; Biographies and Autobiographies

Parent Help Booklet. Level 3

Fry Phrases Set 1. TeacherHelpForParents.com help for all areas of your child s education

Reading IV Grade Level 4

Identifying free text plagiarism based on semantic similarity

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Curriculum Catalog

McDougal Littell Bridges to Literature Level III. Alaska Reading and Writing Performance Standards Grade 8

DanNet From Dictionary to Wordnet

Universal. Event. Product. Computer. 1 warehouse.

CS 533: Natural Language. Word Prediction

Language Arts Literacy Areas of Focus: Grade 5

Opinion Mining Issues and Agreement Identification in Forum Texts

PREPOSITIONAL PHRASES

Department of Cognitive Sciences University of California, Irvine 1

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

King Midas & the Golden Touch

Language Arts Literacy Areas of Focus: Grade 6

Generating Query Substitutions

TESTING WHETHER THE TEMPERATURE OF A MAGNET WILL AFFECT HOW FAR ITS MAGNETIC FIELD IS

Self-Monitoring in Social Networks

NATIVE XML DATABASES vs. RELATIONAL DATABASES IN DEALING WITH XML DOCUMENTS

Free Construction of a Free Swedish Dictionary of Synonyms

Index. 344 Grammar and Language Workbook, Grade 8

Q1. (a) The graph shows the temperature inside a flat between 5 pm and 9 pm. The central heating was on at 5 pm.

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Building a Large Scale Lexical Ontology for Portuguese

Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries

Large-Scale PatternBased Information

A Framework for Ontology-Based Knowledge Management System

The Lois Project: Lexical Ontologies for Legal Information Sharing

Extraction of Hypernymy Information from Text

Describe the process of parallelization as it relates to problem solving.

Our Earth, Our Resources

SakkamMT White Paper

Prepositions. TLC/College of the Canyons. Prepared by Kim Haglund, M.Ed: TLC Coordinator

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

There And Back Again from Dictionary to Wordnet to Thesaurus and Vice Versa: How to Use and Reuse Dictionary Data in a Conceptual Dictionary

Problems with the current speling.org system

Transcription:

SI485i : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney

Three levels of meaning 1. Lexical Semantics (words) 2. Sentential / Compositional / Formal Semantics 3. Discourse or Pragmatics meaning + context + world knowledge

The unit of meaning is a sense One word can have multiple meanings: Instead, a bank can hold the investments in a custodial account in the client s name. But as agriculture burgeons on the east bank, the river will shrink even more. A word sense is a representation of one aspect of the meaning of a word. bank here has two senses

Terminology Lexeme: a pairing of meaning and form Lemma: the word form that represents a lexeme Carpet is the lemma for carpets Dormir is the lemma for duermes The lemma bank has two senses: Financial insitution Soil wall next to water A sense is a discrete representation of one aspect of the meaning of a word

Relations between words/senses Homonymy Polysemy Synonymy Antonymy Hypernymy Hyponymy Meronymy

Homonymy Homonyms: lexemes that share a form, but unrelated meanings Examples: bat (wooden stick thing) vs bat (flying scary mammal) bank (financial institution) vs bank (riverside) Can be homophones, homographs, or both: Homophones: write and right, piece and peace Homographs: bass and bass

Homonymy, yikes! Homonymy causes problems for NLP applications: Text-to-Speech Information retrieval Machine Translation Speech recognition Why?

Polysemy Polysemy: when a single word has multiple related meanings (bank the building, bank the financial institution, bank the biological repository) Most non-rare words have multiple meanings

Polysemy 1. The bank was constructed in 1875 out of local red brick. 2. I withdrew the money from the bank. Are those the same meaning? We might define meaning 1 as: The building belonging to a financial institution And meaning 2: A financial institution

How do we know when a word has more than one sense? The zeugma test! Take two different uses of serve: Which flights serve breakfast? Does America West serve Philadelphia? Combine the two: Does United serve breakfast and San Jose? (BAD, TWO SENSES)

Synonyms Word that have the same meaning in some or all contexts. couch / sofa big / large automobile / car vomit / throw up water / H 2 0

Synonyms But there are few (or no) examples of perfect synonymy. Why should that be? Even if many aspects of meaning are identical Still may not preserve the acceptability based on notions of politeness, slang, register, genre, etc. Example: Big/large Brave/courageous Water and H 2 0

Antonyms Senses that are opposites with respect to one feature of their meaning Otherwise, they are very similar! dark / light short / long hot / cold up / down in / out

Hyponyms and Hypernyms Hyponym: the sense is a subclass of another sense car is a hyponym of vehicle dog is a hyponym of animal mango is a hyponym of fruit Hypernym: the sense is a superclass vehicle is a hypernym of car animal is a hypernym of dog fruit is a hypernym of mango hypernym vehicle fruit furniture mammal hyponym car mango chair dog

WordNet A hierarchically organized lexical database On-line thesaurus + aspects of a dictionary Versions for other languages are under development http://wordnetweb.princeton.edu/perl/webwn Category Unique Forms Noun 117,097 Verb 11,488 Adjective 22,141 Adverb 4,601

WordNet senses The set of near-synonyms for a WordNet sense is called a synset (synonym set) Example: chump as a noun to mean a person who is gullible and easy to take advantage of gloss: (a person who is gullible and easy to take advantage of) Each of these senses share this same gloss

WordNet Hypernym Chains

Word Similarity Synonymy is binary, on/off, they are synonyms or not We want a looser metric: word similarity Two words are more similar if they share more features of meaning

Why word similarity? Information retrieval Question answering Machine translation Natural language generation Language modeling Automatic essay grading Document clustering

Two classes of algorithms Thesaurus-based algorithms Based on whether words are nearby in Wordnet Distributional algorithms By comparing words based on their distributional context in corpora

Thesaurus-based word similarity Find words that are connected in the thesaurus Synonymy, hyponymy, etc. Glosses and example sentences Derivational relations and sentence frames Similarity vs Relatedness Related words could be related any way car, gasoline: related, but not similar car, bicycle: similar

Path-based similarity Idea: two words are similar if they re nearby in the thesaurus hierarchy (i.e., short path between them)

Tweaks to path-based similarity pathlen(c 1, c 2 ) = number of edges in the shortest path in the thesaurus graph between the sense nodes c 1 and c 2 sim path (c 1, c 2 ) = log pathlen(c 1, c 2 ) wordsim(w 1, w 2 ) = max c 1 senses(w 1 ), c 2 senses(w 2 ) sim(c 1, c 2 )

Problems with path-based similarity Assumes each link represents a uniform distance nickel to money seems closer than nickel to standard Seems like we want a metric which lets us assign different lengths to different edges but how?

From paths to probabilities Don t measure paths. Measure probability? Define P(c) as the probability that a randomly selected word is an instance of concept (synset) c P(ROOT) = 1 The lower a node in the hierarchy, the lower its probability

Estimating concept probabilities Train by counting concept activations in a corpus Each occurence of dime also increments counts for coin, currency, standard, etc. More formally:

Concept probability examples WordNet hierarchy augmented with probabilities P(c):

Information content: definitions Information content: IC(c)= log P(c) Lowest common subsumer LCS(c 1, c 2 ) = the lowest common subsumer I.e., the lowest node in the hierarchy that subsumes (is a hypernym of) both c 1 and c 2 We are now ready to see how to use information content IC as a similarity metric

Information content examples WordNet hierarchy augmented with information content IC(c): 0.403 0.777 1.788 2.754 3.947 4.724 4.078 4.666

Resnik method The similarity between two words is related to their common information The more two words have in common, the more similar they are Resnik: measure the common information as: The information content of the lowest common subsumer of the two nodes sim resnik (c 1, c 2 ) = log P(LCS(c 1, c 2 ))

Resnik example sim resnik (hill, coast) =? 0.403 0.777 1.788 2.754 3.947 4.724 4.078 4.666

Some Numbers Let s examine how the various measures compute the similarity between gun and a selection of other words: w2 IC(w2) lso IC(lso) Resnik ----------- --------- -------- ------- ----- -- ------- ------- gun 10.9828 gun 10.9828 10.9828 weapon 8.6121 weapon 8.6121 8.6121 animal 5.8775 object 1.2161 1.2161 cat 12.5305 object 1.2161 1.2161 water 11.2821 entity 0.9447 0.9447 evaporation 13.2252 [ROOT] 0.0000 0.0000 IC(w2): information content (negative log prob) of (the first synset for) word w2 lso: least superordinate (most specific hypernym) for "gun" and word w2. IC(lso): information content for the lso.

The (extended) Lesk Algorithm Two concepts are similar if their glosses contain similar words Drawing paper: paper that is specially prepared for use in drafting Decal: the art of transferring designs from specially prepared paper to a wood or glass or metal surface For each n-word phrase that occurs in both glosses Add a score of n 2 Paper and specially prepared for 1 + 4 = 5

Recap: thesaurus-based similarity

Problems with thesaurus-based methods We don t have a thesaurus for every language Even if we do, many words are missing Neologisms: retweet, ipad, blog, unfriend, Jargon: poset, LIBOR, hypervisor, Typically only nouns have coverage What to do?? Distributional methods.