High-Performance, Language-Independent Morphological Segmentation



Similar documents
TOWARD LANGUAGE-INDEPENDENT MORPHOLOGICAL SEGMENTATION AND PART-OF-SPEECH INDUCTION. Sajib Dasgupta APPROVED BY SUPERVISORY COMMITTEE:

Discovering suffixes: A Case Study for Marathi Language

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

AP CALCULUS AB 2007 SCORING GUIDELINES (Form B)

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Classification and Prediction

Brill s rule-based PoS tagger

Statistical Machine Translation

Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus

Text Analytics Illustrated with a Simple Data Set

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -

Scheduling Algorithms in MapReduce Distributed Mind

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Data Mining for Knowledge Management. Classification

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Web Data Extraction: 1 o Semestre 2007/2008

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Pearson's Correlation Tests

Barbara J. Ehren, Ed.D., CCC-SLP Research Associate

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

The Influence of Topic and Domain Specific Words on WER

Language Modeling. Chapter Introduction

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Module 9 The CIS error profiling technology

AP CALCULUS AB 2006 SCORING GUIDELINES (Form B) Question 4

Big Data Analytics CSCI 4030

How To Cluster Of Complex Systems

EMR Setup and Data Entry Best Practices to Optimize EMR Reporting

UCHIME in practice Single-region sequencing Reference database mode

Recovery System C H A P T E R16. Practice Exercises

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

2014/02/13 Sphinx Lunch

Exploiting Collective Hidden Structures in Webpage Titles for Open Domain Entity Extraction

7 Gaussian Elimination and LU Factorization

This paper is being submitted to the Special Issue on Information Retrieval for Indian Languages of the ACM Transactions on Asian Language

Leveraging Ensemble Models in SAS Enterprise Miner

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Charter School Market Dynamics and School Quality

How To Teach Reading

Denial of Service and Anomaly Detection

Massachusetts Tests for Educator Licensure

Data Mining Individual Assignment report

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

MATHEMATICAL INDUCTION. Mathematical Induction. This is a powerful method to prove properties of positive integers.

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

An Automated Test for Telepathy in Connection with s

Micro blogs Oriented Word Segmentation System

Reading Competencies

Identifying Focus, Techniques and Domain of Scientific Papers

Understanding Credit Reports and Scores and How to Improve it!

Morphemes, roots and affixes. 28 October 2011

Differential privacy in health care analytics and medical research An interactive tutorial

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Database Constraints and Design

Protein Protein Interaction Networks

Environmental Remote Sensing GEOG 2021

Testing LTL Formula Translation into Büchi Automata

A Systematic Cross-Comparison of Sequence Classifiers

Search and Information Retrieval

Test Blueprint. Grade 3 Reading English Standards of Learning

Descriptive Statistics

Real-time customer information data quality and location based service determination implementation best practices.

Discovering Sequential Rental Patterns by Fleet Tracking

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Data Deduplication in Slovak Corpora

Mining the Software Change Repository of a Legacy Telephony System

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Study Guide for the Final Exam

IBM SPSS Data Preparation 22

Two-sample inference: Continuous data

Likelihood Approaches for Trial Designs in Early Phase Oncology

Spam Detection A Machine Learning Approach

Permutation Tests for Comparing Two Populations

Domain Classification of Technical Terms Using the Web

MATH10040 Chapter 2: Prime and relatively prime numbers

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Correlational Research

Approximate Search Engine Optimization for Directory Service

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

The aerodynamic center

Evaluation of Bayesian Spam Filter and SVM Spam Filter

Attachment 2 Performance Metrics

Drawing a histogram using Excel

Decision Trees from large Databases: SLIQ

Roots of Equations (Chapters 5 and 6)

Statistics Review PSY379

Measuring and Monitoring the Quality of Master Data By Thomas Ravn and Martin Høedholt, November 2008

Comparing Means Between Groups

Take value-add on a test drive. Explore smarter ways to evaluate phone data providers.

Packet TDEV, MTIE, and MATIE - for Estimating the Frequency and Phase Stability of a Packet Slave Clock. Antti Pietiläinen

Configuration Manager

Loop Invariants and Binary Search

Step by Step Guide to Importing Genetic Data into JMP Genomics

Comparison of K-means and Backpropagation Data Mining Algorithms

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

Transcription:

High-Performance, Language-Independent Morphological Segmentation Sajib Dasgupta and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1

Morphology Segmentation of words into prefixes, suffixes and roots. Example unfriendliness = un + friend + ly + ness 2

Goal: Unsupervised Morphology Input: Unannotatedtext corpus Word aback abacus abacuses abalone abandon abandoned abandoning abandonment abandonments abandons... Frequency 157 6 3 77 2781 4696 1082 378 23 117... Output: Word segmentation Word aback abacus abacuses abalone abandon abandoned abandoning abandonment abandonments abandons... Segmentation aback abacus abacus+es abalone abandon abandon+ed abandon+ing abandon+ment abandon+ment +s abandon+s... 3

Why Unsupervised Morphology? Advantages No linguistic knowledge input Language Independent Resource Scarce Languages Only thing we need is a text corpus!!! We tested on 4 different languages English, Finnish, Turkish, Bengali 4

Plan for the Talk Basic system Motivated by Keshava and Pitler, the bestperforming algorithm in PASCAL-MorphoChallenge, 2005 Extensions Address two problems with the basic system 5

Problem 1: Over-segmentation validate = valid + ate candidate = candid + ate devalue = de + value decrease = de + crease Biggest source of errors Tough to solve!! 6

Problem 2:Orthographic Change Consider denial = deny (deni) + al stability = stable (stabil) + ity How to handle spelling changes? 7

Basic System 1. Induce prefixes, suffixes and roots 2. Segment the words using the induced prefixes, suffixes and roots 8

Inducing Prefixes and Suffixes Assumptions (Keshava and Pitler, 2006) xyand x in vocabulary yis a suffix xyand y in vocabulary xis a prefix Too simplistic! <diverge, diver> ge is a suffix (Wrong!) Solution: Score the affixes and remove low-scoring affixes. 9

Scoring the Affixes Scoring metric score (x) = frequency (x) * length (x) # of different words x attaches to # of characters in x Retain only affixes whose score > k, where k is the threshold. 10

Setting the threshold k k is dependent on vocabulary size In a larger vocabulary, same affix attaches to larger number of words. So, we need to set larger k for larger vocabulary system For example, English: 50 Finnish: 300 (Finnish vocabulary is almost 6 times larger than that of English) 11

Basic System Induce prefixes, suffixes and roots Segment the words using the induced morphemes 12

Inducing Roots For each word w in the vocabulary, we consider w as a root if it is not divisible i.e. w cannot be segmented as p+r or r+x, where p is a prefix, x is a suffix and r is a word in the vocabulary 13

Plan for the Talk Basic system Segmentation using automatically induced morphemes Two extensions to the basic system Handling over-segmentation (addresses Problem 1) Handling orthographic changes (addresses Problem 2) 14

Over-Segmentation affectionate = affection + ate? Correct attachment candidate = candid + ate? Incorrect attachment We propose 2 methods: Relative Word-Root Frequency Suffix Level Similarity 15

Relative Frequency Our hypothesis: If a word w can be segmented as r+a or a+r, where r is a root and a is an affix, then correct-attachment(w, r, a) freq (w) < freq (r) Inflectional or derivational form of a root word occurs less frequently than the root itself Examples freq (reopen) < freq (open) => reopen = re+open freq (candidate) > freq (candid) => candidate candid+ate 6380 119 16

How correct is this hypothesis? In other words, does all the inflectional or derivational forms of the root words occur less frequently than the root? We randomly chose 387 English words which can be segmented into Prefix+Root or Root+Suffix For each word we check Word-Root Frequency Ratio (WRFR) #of words WRFR<1 Root+Suffi 344 x 70.1% Prefix+Roo 34 t 88.2% Overall 378 71.7% 17

Relaxing the Hypothesis To increase the accuracy of the hypothesis we relax it as follows: Original Hypothesis: 1 correct-attachment(w, r, a) WRFR (w, r) < Relaxed Hypothesis: t correct-attachment(w, r, a) WRFR (w, r) < We set the threshold t to 10 for suffixes and 2 for prefixes 18

Over Segmentation We propose 2 methods: Relative Word-Root Frequency Suffix Level Similarity 19

Suffix Level Similarity Our hypothesis: If w attaches to a suffix x, then w should attach to suffixes similar to x. i.e. w+ x w+ Similar (x) candid + ate = candidate? 20

Suffix Level Similarity Our hypothesis: If w attaches to a suffix x, then w should attach to suffixes similar to x. i.e. w+ x w+ Similar (x) candid + ate = candidate? 21

Computing similarity between two suffixes How to give more weight to ated than s? Similarity Metric: ate Sim ( x, y) = P( y x)* P( x y) = s n *, 1 n 2 where n 1 is the number of words that combine with x n 2 is the number of words that combine with y n is the number of words that combine with both x and y n n 22

Suffix Level Similarity How to use suffix level similarity to check whether r + x is correct? We get 10 most similar suffixes of x according to Sim. We scale each suffix s Sim value linearly between 1-10. Given 10 similar suffixes x 1, x 2,, x 10 we check 10 1 f i w i > t, where f i is 1 if x i attaches to r. w i is the scaled similarity between suffix x and x i. t is a predefined threshold (t > 0) 23

Suffix Level Similarity It s too strict for words that do not attach to many suffixes. We decided to combine WRFR and suffix level similarity to detect incorrect attachments: -WRFR + β* (suffix level similarity) < 0, where β is set to be 0.15 for all 4 languages we considered 24

Plan for the Talk Basic system Segmentation using automatically induced morphemes Two extensions to the basic system Handling over-segmentation (addresses Problem 1) Handling orthographic changes (addresses Problem 2) 25

Handling Orthographic Changes Goal: Segment words like denial into deny + al Challenge: System does not have anyknowledge of language-specific orthographic rules like y:i / _ + al 26

Handling Orthographic Changes Can we generate the orthographic change rules automatically? 27

Handling Orthographic Changes Can we generate the orthographic change rules automatically? Yes, but. Considering the complexity of the task, we will focus on Change at the morpheme boundary only Change of edit distance 1 (i.e. 1 character insertion or deletion or replacement) Our orthographic induction algorithm consists of 3 steps 28

Step1: Inducing Candidate Allomorphs If αaβ is a word in the vocabulary (e.g. denial ), β is an induced suffix (e.g. al ), αb is an induced root (e.g. deny ), The attachment of β to αb is correct according to relative frequency, then αa (e.g. deni ) is an allomorph of αb (e.g. deny ) The allomorphs generated from at least 2 different suffixes are called candidate allomorphs Now we have a list of <candidate allomorph, root, suffix> tuples (e.g. <deni, deny, al>) 29

Step 2: Inducing Orthographic Rules Each <candidate allomorph, root, suffix> tuple is associated with an orthographic rule: From <denial, deny, al> we learn y:i/ _ + al From <social, sock, al> we learn k:i/ _ + al wrong! So, social = sock + al? We have to filter out erroneous rules 30

Step 3: Filtering the Rules Goal: For each suffix x, remove the low-frequency rules Frequency of a rule r, # of different <allomorph, root, x> generate the rule r Remove r if it is generated by less than 15% of the tuples. 31

Step 3: Filtering the Rules Goal: Filter morphologically undesirable rules If there are two rules like A:B / _ + x and A:C / _ + x, then the rules are morphologically undesirable, because A changes to B and C under the same environment x. To filter morphologically undesirable rules, 1. We define the strength of a rule as follows: strength( A : B) = @ frequency( A : B) frequency( A : @) 2. We then keep only those rules r whose frequency(r) * strength(r) > t 32

Setting the threshold t t is dependent on vocabulary size For example, English: 4 Finnish: 25 (Finnish vocabulary is almost 6 times larger than that of English) 33

Evaluation Results for English and Bengali PASCAL Morpho-Challenge results English, Finnish, Turkish 34

Experimental Setup: Corpora English: WSJ and BLLIP Bengali: 5 years of news articles from ProthomAlo 35

Experimental Setup: Test Set Creation English: 5000 words from our corpus Correct segmentation given by CELEX Bengali: 4191 words from our corpus Correct segmentation provided by two linguists 36

Experimental Setup: Evaluation Metrics Exact accuracy F-score 37

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 38

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 39

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 40

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 41

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 42

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 43

Results English Bengali Acc P R F Acc P R F Linguistica 68.9 84.8 75.7 80.0 36.3 58.2 63.3 60.6 Morphessor 64.9 69.6 85.3 76.6 56.5 89.7 67.4 76.9 Basic system Relative frequency Suffix level similarity Allomorph detection 68.1 79.4 82.8 81.1 57.7 79.6 81.2 80.4 74.0 86.4 82.5 84.4 63.2 85.6 79.9 82.7 74.9 88.6 82.3 85.3 66.1 89.7 78.8 83.9 78.3 88.3 86.4 87.4 68.3 89.3 81.3 85.1 44

PASCAL Morpho-Challenge Results Three languages English Finnish Turkish 45

Results (F-scores) English Finnish Turkish Winner 76.8 64.7 65.3 Our System 79.4 65.2 66.2 Morphessor 66.2 66.4 70.1 Winner for English: Keshava and Pitler s (2006) algorithm Winner for Finnish and Turkish: Bernhard s (2006) algorithm Creutz s remark on the PASCAL Challenge: None of the participants performs well on all 3 languages 46

Results (F-scores) English Finnish Turkish Winner 76.8 64.7 65.3 Our System 79.4 65.2 66.2 Morphessor 66.2 66.4 70.1 Our system outperforms the winners! Robustness across different languages 47

Results (F-scores) English Finnish Turkish Winner 76.8 64.7 65.3 Our System 79.4 65.2 66.2 Morphessor 66.2 66.4 70.1 Morphessor slightly outperforms our system for Finnish and Turkish, but what about English? 48

Results (F-scores) English Finnish Turkish Winner 76.8 64.7 65.3 Our System 79.4 65.2 66.2 Morphessor 66.2 66.4 70.1 Morphessor performs poorly for English 49

Conclusion Our system shows robust performance across different languages Outperforms Linguistica and Morphessor for English and Bengali Compares favorably to the winners of the PASCAL datasets 50

Thank you! 51