Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction



Similar documents
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Tagging with Hidden Markov Models

Word Completion and Prediction in Hebrew

Micro blogs Oriented Word Segmentation System

Building A Vocabulary Self-Learning Speech Recognition System

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

A POS-based Word Prediction System for the Persian Language

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Course: Model, Learning, and Inference: Lecture 5

Log-Linear Models. Michael Collins

Estonian Large Vocabulary Speech Recognition System for Radiology

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Special Topics in Computer Science

PoS-tagging Italian texts with CORISTagger

Development of dynamically evolving and self-adaptive software. 1. Background

Training and evaluation of POS taggers on the French MULTITAG corpus

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Systematic Cross-Comparison of Sequence Classifiers

Learning Morphological Disambiguation Rules for Turkish

Collecting Polish German Parallel Corpora in the Internet

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Turkish Radiology Dictation System

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Introduction to Algorithmic Trading Strategies Lecture 2

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

Chapter 8. Final Results on Dutch Senseval-2 Test Data

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Improving statistical POS tagging using Linguistic feature for Hindi and Telugu

Language Modeling. Chapter Introduction

QUT Digital Repository:

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Spam Filtering with Naive Bayesian Classification

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

How to make Ontologies self-building from Wiki-Texts

Context Grammar and POS Tagging

Chapter 7. Language models. Statistical Machine Translation

Language and Computation

Natural Language Database Interface for the Community Based Monitoring System *

Activities. but I will require that groups present research papers

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Simple Type-Level Unsupervised POS Tagging

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Multilingual Sentiment Analysis on Social Media. Erik Tromp. July 2011

Segmentation and Classification of Online Chats

CS 533: Natural Language. Word Prediction

Factored bilingual n-gram language models for statistical machine translation

DEPENDENCY PARSING JOAKIM NIVRE

Performance Analysis of a Telephone System with both Patient and Impatient Customers

Overview of the EVALITA 2009 PoS Tagging Task

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Fast string matching

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Automated Content Analysis of Discussion Transcripts

Terminology Extraction from Log Files

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Simple Language Models for Spam Detection

Customizing an English-Korean Machine Translation System for Patent Translation *

Structure of Clauses. March 9, 2004

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Method for Automatic De-identification of Medical Records

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

Chapter 5. Phrase-based models. Statistical Machine Translation

Towards a Visually Enhanced Medical Search Engine

Model Checking II Temporal Logic Model Checking

Spam Detection A Machine Learning Approach

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Statistical Machine Translation

Hidden Markov Models

Computer Aided Document Indexing System

Lecture 2 Introduction to Data Flow Analysis

Less naive Bayes spam detection

Twitter sentiment vs. Stock price!

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Transcription:

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Uwe D. Reichel Department of Phonetics and Speech Communication University of Munich reichelu@phonetik.uni-muenchen.de

Abstract Generalization of a Markov model part-of-speech (POS) tagger: replacing the P (w t) emission probabilities of word w given tag t by a linear interpolation of tag emission probabilities given a list of representations of w Word Representation: string suffix of word cut off at a local maximum of backward successor variety What for? retrieval of linguistically meaningful string suffixes, that may relate to certain POS labels, without the need of linguistic knowledge (language independence, addressing out of vocabulary (OOV) problem) Results: Basic Markov model POS taggers are significantly outperformed. Abstract 1

Basic Form of a Markov POS Tagger (Jelinek, 1985) Estimate for most probable tag sequence ˆT given word sequence W [ ] ˆT = arg max P (T W ) (1) T [ ] = arg max P (T )P (W T ) (Bayes, P(W) constant) (2) T Simplifying Assumptions Probability of word w i depends only on its tag t i Probability of tag t i depends only on a limited tag history t-history i [ n ˆT = arg max t 1...t n i=1 ] P (t i t-history i )P (w i t i ) (3) Retrieval of ˆT using the Viterbi algorithm Markov Tagger 2

Generalisations of the Basic Model by linear interpolation replacing P (t i t-history i ) by j u jp (t i t-history ij ) replacing P (w i t i ) by P (w i) P (t i ) k v kp (t i w-representation ik ) (reapplication of Bayes formula) [ n ˆT = arg max t 1...t n i=1 1 P (t i ) u j P (t i t-history ij ) j k ] v k P (t i w-representation ik ) calculation of interpolation weights u j and v k via the EM algorithm (4) Markov Tagger 3

Word Representation: String Suffixes Motivation in many languages (e.g. German, English) POS information is stored in suffix morphemes, inflectional endings, back parts of compounds (Gelegenheit/NN, Umgehungsstraße/NN, partly/adv) the usage of these entities next to the whole word reduces the OOV problem (see also Suendermann and Ney, 2003) Desideratum: retrieve linguistically meaningful string suffixes without prior linguistic knowledge language independence Word Representation 4

Word Representation: Retrieval I suffixes are determined by Weighted Backward Successor Variety (SV) SV of a string: number of different characters that follow it in given lexicon Backward SV: SV s are calculated from reversed strings in order to separate linguistically meaningful suffixes Weighting: SV s are weighted w.r.t. mean SV at the corresponding string position to eliminate positional effects lexicon of reversed words represented in form of a trie (see next sheet) SV at given state: number of transitions to other states Usage: treat SV peaks as morpheme boundaries (cf. Peak and Plateau algorithm (Nascimento and da Cunha, 1998)) Word Representation 5

Word Representation: Retrieval II Lexicon Trie (reversely) storing the entries Einigung, Kreuzigung and Eignung The SV maxima at nodes 3 and 5 correspond to the boundaries of the morphemes ung and ig respectively Word Representation 6

Training and Application Training build lexicon trie for training material get word representations: word form and two string suffixes derived by SV (e.g. kommenden [ :kommenden, en, enden ]) calculate for all word representations w rep, tags t and tag histories t hist: P (t w rep) and P (t t hist) as well as the interpolation weights for equation 4 Application transform each word into list of representations (as in training) get most probable tag sequence via equation 4 applying Viterbi Training and Application 7

Data and Results Data: 382402 tokens tagged by the IMS Tree Tagger (Schmidt, 1995) and partially hand corrected; 85 % used for training, 15 % for testing (OOV: 38.12 % of types, 11.52 % of tokens) Classes: 54 different POS tags (Tree Tagger inventory) Results: accuracy κ relative entropy Baseline Taggers: Unigram 89.61 % 0.89 1.33 lin. interpolated Trigram 93.22 % 0.93 0.61 New Tagger: Trigram, word repr. 95.36 % 0.95 0.45 This study s tagger significantly outperforms the baseline taggers (two tailed McNemar test, p = 0.001) erroneous data probably affects accuracy (e.g. finite vs. infinite verbs) Data and Results 8

Literature Jelinek, F. (1985). Markov source modeling of text generation. In J.K. Skwirzynski (ed.), The Impact of Processing Techniques on Communications, vol. E91 of NATO ASI series. Dordrecht: M. Nijhoff. Nascimento, M.A., da Cunha, A.C.R. (1998). An Experiment Stemming Non-Traditional Text. SPIRE 98 Proceedings, Santa Cruz de La Sierra, Bolivia. Schmid, H. (1995). Improvements in Part-of-Speech Tagging with an Application to German. In: EACL, SIGDAT, Dublin. Suendermann, D., Ney, H. (2003). Synther a new m-gram POS tagger. In: Proc. NLP-KE, Bejing, China. Literature 9