An Integrated Approach to Automatic Synonym Detection in Turkish Corpus



Similar documents
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Turkish synonym identification from multiple resources: monolingual corpus, mono/bilingual on-line dictionaries and WordNet

Intro to Linguistics Semantics

Cross-lingual Synonymy Overlap

Building a Question Classifier for a TREC-Style Question Answering System

Clever Search: A WordNet Based Wrapper for Internet Search Engines

Natural Language Processing. Part 4: lexical semantics

Semantic Clustering in Dutch

Comparing Topiary-Style Approaches to Headline Generation

POLITE ENGLISH. Expressing your gratitude in English. FREE ON-LINE COURSE. Lesson 6: version without a key

Clustering of Polysemic Words

DISA at ImageCLEF 2014: The search-based solution for scalable image annotation

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Text Analytics. A business guide

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

1 Class Diagrams and Entity Relationship Diagrams (ERD)

Clustering Connectionist and Statistical Language Processing

Extraction of Hypernymy Information from Text

A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons

Transaction-Typed Points TTPoints

The study of words. Word Meaning. Lexical semantics. Synonymy. LING 130 Fall 2005 James Pustejovsky. ! What does a word mean?

Identifying free text plagiarism based on semantic similarity

Semantic analysis of text and speech

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Free Construction of a Free Swedish Dictionary of Synonyms

Automated Extraction of Security Policies from Natural-Language Software Documents

Information Retrieval, Information Extraction and Social Media Analytics

Defining Antonymy: A Corpus-based Study of Opposites Found by Lexico-syntactic Patterns Abstract

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

A Semantic Portal for the International Affairs Sector

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

SIMOnt: A Security Information Management Ontology Framework

Automatic Detection of Nocuous Coordination Ambiguities in Natural Language Requirements

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Numerical Data Integration for Cooperative Question-Answering

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

DanNet Teaching and Research Perspectives at CST

Get the most value from your surveys with text analysis

Words and Patterns: Lexico-Grammatical Patterns and Semantic Relations in Domain-Specific Discourses

T U R K A L A T O R 1

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

On the use of antonyms and synonyms from a domain perspective

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Semantic relation extraction and its applications

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Search results optimization approach based on semantics

An empirical study of semantic similarity in WordNet and Word2Vec. A Thesis

Effective Mentor Suggestion System for Collaborative Learning

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Movie Classification Using k-means and Hierarchical Clustering

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

An Iterative Method of Extracting Chinese ISA Relations for Ontology Learning

A typology of ontology-based semantic measures

SEMANTIC RESOURCES AND THEIR APPLICATIONS IN HUNGARIAN NATURAL LANGUAGE PROCESSING

A Survey on Product Aspect Ranking Techniques

A Frequent Concepts Based Document Clustering Algorithm

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Constituency. The basic units of sentence structure

Verb Learning in Non-social Contexts Sudha Arunachalam and Leah Sheline Boston University

Opinion Mining Issues and Agreement Identification in Forum Texts

The Lois Project: Lexical Ontologies for Legal Information Sharing

Introduction to WordNet: An On-line Lexical Database. (Revised August 1993)

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Welcome to the Data Analytics Toolkit PowerPoint presentation on EHR architecture and meaningful use.

Analyzing survey text: a brief overview

Flattening Enterprise Knowledge

Rev. TX FPSP, 07/12. Global Issues Problem Solving Team Booklet

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Chapter 7: Data Mining

Natural Language Database Interface for the Community Based Monitoring System *

The English Genitive Alternation

Statistical Machine Translation

EMILY WANTS SIX STARS. EMMA DREW SEVEN FOOTBALLS. MATHEW BOUGHT EIGHT BOTTLES. ANDREW HAS NINE BANANAS.

Overview of MT techniques. Malek Boualem (FT)

Deriving a Large Scale Taxonomy from Wikipedia

A Capability Model for Business Analytics: Part 3 Using the Capability Assessment

2. SEMANTIC RELATIONS

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

Sentiment Classification. in a Nutshell. Cem Akkaya, Xiaonan Zhang

Mining Opinion Features in Customer Reviews

ISSN: Sean W. M. Siqueira, Maria Helena L. B. Braz, Rubens Nascimento Melo (2003), Web Technology for Education

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Context Grammar and POS Tagging

Acquiring Ontological Relationships from Wikipedia Using RMRS

Automatically populating an image ontology and semantic color filtering

Using a Treebank for Finding Opposites

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Personalized Healthcare Cloud Services for Disease Risk Assessment and Wellness Management using Social Media. Abstract

THE knowledge needed by software developers

Semantic Structure Matching for Assessing Web-Service Similarity

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

CHAPTER-6 DATA WAREHOUSE

A Serious Game for Building a Portuguese Lexical-Semantic Network

Sentiment-Oriented Contextual Advertising

Domain Ontology Based Class Diagram Generation From Functional Requirements

Transcription:

An Integrated Approach to Automatic Synonym Detection in Turkish Corpus Dr. Tuğba YILDIZ Assist. Prof. Dr. Savaş YILDIRIM Assoc. Prof. Dr. Banu DİRİ İSTANBUL BİLGİ UNIVERSITY YILDIZ TECHNICAL UNIVERSITY Department of Computer Engineering September 17, 2014

1 Semantic Relation 2 Experimental Setup 3 Methodology 4 Results and Evaluation 5 Future Work 6 Conclusion

Semantic Relation Semantic relations are underlying relations betwen two concepts expressed by words or phrases (Moldovan 2004) Contradictory views about number of semantic relations Uncountability (Murphy 2003) 13 classes (Rosario 2001) 17 classes (Stephens 2001) 5 classes at the top and 30 classes at the bottom (Nastase 2003) 35 classes (Moldovan 2004) 7 classes (SemEval 2007 - Task4) etc.

Semantic Relation Five main families of relations: Hyponym/Hypernym (Class inclusion) Meronym/Holonym (Part-Whole) Synonym (Similars) Antonym (Contrast) Case

Semantic Relation Five main families of relations: Hyponym/Hypernym (Class inclusion) Meronym/Holonym (Part-Whole) Synonym (Similars) Antonym (Contrast) Case

Semantic Relation In this study: Synonym (Similars) Synonyms are words with identical or similar meanings car is synonymous with automobile Hyponym/Hypernym (Class inclusion) A relation of inclusion (IS-A / kind of) A dog is an animal dog: hyponym / animal: hypernym Meronym/Holonym (Part-Whole) A relationship between terms that respect to the significant parts of a whole the eye is part of the face eye: part / face: whole

Related Works Methods are mostly based on distributional similarity. Distributional hypothesis which adopts that semantically similar words share similar contexts (Harris 1954). However insufficient for synonymy! covers near-synonyms (Eg. top-5 for orange is: yellow, lemon, peach, pink, lime) (Lin et. al. 2003) does not distinguish between synonyms and other relations Different strategies: integrating two independent approaches such as distributional similarity and pattern-based approaches utilizing external features such as dependency relations.

Related Works In Turkish, recent studies on synonym relations are based on dictionary definition TDK 1 and Wiktionary. Defined rules and phrasal patterns that occur in the dictionary definitions with. Objective of this study: to determine synonymy in a Turkish Corpus The main contributions of our work: its corpus-driven characteristics relies on both dependency and semantic relations. 1 TDK:Turkish Language Association

Language Resources BOUN Web corpus with 500M tokens (Sak et.al. 2008) four sub-corpora: Three of them named NewsCor are from three major Turkish news portals GenCor is a general sampling of web pages in the Turkish Language morphological parser based on two-level morphology an averaged perceptron-based morphological disambiguator Preprocessing surface+root+pos-tag+[and all other markers] Turkish: araba English: car arabanın+araba+noun+a3sg+pnon+gen

Methodology No particular LSPs to extract synonymy Our main assumption: Synonym pairs mostly show similar dependency and semantic characteristics in corpus. They share the same meronym/holonym relations, same particular list of governing verbs, adjective modification profile and so on. Examples: Automobile is a vehicle Car is a vehicle Wheel is part of automobile Wheel is part of car

Methodology Model: determine if a given word pair is synonym or not. Data: Manually and randomly selected 200 synonym pairs(syn) and 200 non-synonym pairs(nonsyn) to build a training data set. Non-synonym pairs are especially selected from associated (relevant) pairs such as tree-leaf, student-school, computer-game, etc. Features from Corpus: For each synonym pair, 15 different features. Co-occurrence (1) Semantic relations based on LSPs (4) Dependency relations based on syntactic patterns (10) Target Class: SYN and NONSYN

Features from Corpus Feature 1- Co-occurrence: The first feature is the co-occurrence of word pairs within a broad context Features 2/3- Meronym/Holonym: A big matrix in which rows depict Whole candidates, columns depict Part candidates Example: Table: 1. Whole X Part Matrix Whole/Part Part1 Part2... Partm Whole1 val. val.... val. Whole2 val. val.... val.... val. val.... val. Wholen val. val.... val. Cells represent the possibility of that corresponding whole and part are in meronymy relation

Features from Corpus Table: 2. General Patterns (GP), Dictionary-based Patterns(TDK-P) and Bootstrapped Patterns(BP) GP TDK-P BP NPx part of NPy Group-of NPy-gen NPx-pos (whole group all) (set flock union) NPx member of NPy Member-of NPy-nom NPx-pos (class member team) (from the family of Y) NPy constituted of NPx Amount-of NPy-Gen (N-ADJ)+ NPx-Pos (amount measure unit) NPy made of NPx Has/Have (Y has l(h)) NPy of one-of NPx NPy consist of NPx Consist-of NPx whose NPy NPy has/have NPx Made-of NPxs with NPy NPy with NPx

Features from Corpus To measure the similarity of meronymy/holonymy profile of two given words, cosine function is applied on two rows/columns indexed by two given words Table: 3. Example of Whole X Part Matrix Whole/Part Wheel... Car... Wiper Automobile... Car X......... X...... School..................... Person..................... Automobile X......... X...... Parking...... X...... X...

Features from Corpus Features 4/5 - Hyponym/Hypernym: Another matrix is built for hypernymy/hyponymy by applying LSPs Same procedure is carried out as Meronym/Holonym. NPs gibi CLASS (CLASS such as NPs) NPs ve diğer CLASS (NPs and other CLASS) CLASS lardan NPs (NPs from CLASS) NPs ve benzeri CLASS (NPs and similar CLASS)

Features from Corpus Features 6 15 - Dependency Relations: are obtained by syntactic patterns The more they are modified by same adjectives, the more likely they are synonym. Examples: Fast Car (+) / Fast Automobile (+) / Handsome Car (X) / Handsome Automobile (X) 36 different patterns were extracted, 8 were eliminated because of the poor results. Then we grouped them according to their syntactic structures.

Features from Corpus Table: 4. Dependency Features Features Dependency relation # of Patterns Examples G1 direct object of verb 13 I drive a car araba sürüyorum G2 subject of verb 3 waiting car bekleyen araba G3 direct object/subject of verb 3 - - G4 modified by adjective+(with/without) 2 car with gasoline benzinli araba G5 modified by inf 1 swimming pool yüzme havuzu G6 modified by noun 1 toy car oyuncak araba G7 modified by adjective 1 red car kırmızı araba G8 modified by acronym locations 1 the cars in ABD ABD deki arabalar G9 modified by proper noun locations 1 the cars in Istanbul Istanbul daki arabalar G10 modified by locations 2 the car at parking lot otoparktaki araba

Results and Evaluation All features explained before are computed for all pairs/instances All computed scores of the pairs are kept as a training set Taking all features into account and applying machine learning algorithms The most suitable algorithm, logistic regression

Results and Evaluation The first aim is to find out which feature is the most informative for detecting synonymy and contributes most to the overall success of the model. Table: 5. F-Measure of Semantic Relations (SRs) Features co-occurrence hyponym hypernym meronym holonym F-Measure 62.5 60.5 60 68.7 73.7 The semantic features are notably better than dependency features. Among semantic relations, the most powerful attributes are meronymy and holonymy features. The possible reason: the sufficient number of cases matched by lexico-syntactic patterns.

Results and Evaluation Among dependency relations, G1, G4 and G7 have better performance Table: 6. F-measure of Dependency Relations Features G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 F-Measure 64.7 58 60.5 65 61.6 58.8 63 49.4 48.3 62.6 The poorest groups, G8 and G9, have low production capacity The successful features are linearly dependent on target class. On the aggregated data where all useful features are considered, the performance of logistic regression is F- measure of 80.3%. The achieved score is better than the individual performance of each feature.

Done In our paper: Dictionary definitions, WordNet, and other useful resources could be used and evaluated in future work. In our review: WordNet seems a good resource helping improve it even more.

Done

Conclusion Semantic relation is one of the problem in applications in NLP. Automatic acquisition of semantic relation has become more popular in recent years. It is obviously impossible to discover synonym pairs by applying LSP. Thanks to semantic and syntactic relations. The most powerful semantic relations: Meronym and Holonym. The most powerful syntactic relations: G4 modified by adjective (with/without). Aggregation of useful features gives promising result.

Thank You For Listening...