A probabilistic part-of-speech tagger for Swedish



Similar documents
Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

Numerical Methods with MS Excel

1. The Time Value of Money

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

CHAPTER 2. Time Value of Money 6-1

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Classic Problems at a Glance using the TVM Solver

APPENDIX III THE ENVELOPE PROPERTY

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

Chapter Eight. f : R R

Average Price Ratios

10.5 Future Value and Present Value of a General Annuity Due

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Chapter = 3000 ( ( 1 ) Present Value of an Annuity. Section 4 Present Value of an Annuity; Amortization

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

Simple Linear Regression

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

How To Value An Annuity

The Digital Signature Scheme MQQ-SIG

Banking (Early Repayment of Housing Loans) Order,

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

MDM 4U PRACTICE EXAMINATION

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

The simple linear Regression Model

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

Speeding up k-means Clustering by Bootstrap Averaging

On Error Detection with Block Codes

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

Curve Fitting and Solution of Equation

of the relationship between time and the value of money.

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

Integrating Production Scheduling and Maintenance: Practical Implications

Constrained Cubic Spline Interpolation for Chemical Engineering Applications

Application of Grey Relational Analysis in Computer Communication

The Time Value of Money

Fast, Secure Encryption for Indexing in a Column-Oriented DBMS

ANNEX 77 FINANCE MANAGEMENT. (Working material) Chief Actuary Prof. Gaida Pettere BTA INSURANCE COMPANY SE

On formula to compute primes and the n th prime

Capacitated Production Planning and Inventory Control when Demand is Unpredictable for Most Items: The No B/C Strategy

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

Reinsurance and the distribution of term insurance claims

IP Network Topology Link Prediction Based on Improved Local Information Similarity Algorithm

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

Statistical Intrusion Detector with Instance-Based Learning

Green Master based on MapReduce Cluster

Settlement Prediction by Spatial-temporal Random Process

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

Automated Event Registration System in Corporation

Performance Attribution. Methodology Overview

Common p-belief: The General Case

A Parallel Transmission Remote Backup System

AP Statistics 2006 Free-Response Questions Form B

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

Three Dimensional Interpolation of Video Signals

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

Learning to Filter Spam A Comparison of a Naive Bayesian and a Memory-Based Approach 1

Load Balancing Control for Parallel Systems

FINANCIAL MATHEMATICS 12 MARCH 2014

10/19/2011. Financial Mathematics. Lecture 24 Annuities. Ana NoraEvans 403 Kerchof

ANALYTICAL MODEL FOR TCP FILE TRANSFERS OVER UMTS. Janne Peisa Ericsson Research Jorvas, Finland. Michael Meyer Ericsson Research, Germany

Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

ISyE 512 Chapter 7. Control Charts for Attributes. Instructor: Prof. Kaibo Liu. Department of Industrial and Systems Engineering UW-Madison

Beta. A Statistical Analysis of a Stock s Volatility. Courtney Wahlstrom. Iowa State University, Master of School Mathematics. Creative Component

Report 52 Fixed Maturity EUR Industrial Bond Funds

The impact of service-oriented architecture on the scheduling algorithm in cloud computing

Fault Tree Analysis of Software Reliability Allocation

The Popularity Parameter in Unstructured P2P File Sharing Networks

Software Reliability Index Reasonable Allocation Based on UML

Raport końcowy Zadanie nr 8:

Sequences and Series

Study on prediction of network security situation based on fuzzy neutral network

Transcription:

A probablstc part-of-speech tagger for Swedsh eter Nlsso Departmet of Computer Scece Uversty of Lud Lud, Swede dat00pe@ludat.lth.se Abstract Ths paper presets a project for mplemetg ad evaluatg a probablstc part-of-speech tagger for Swedsh. The resultg program accurately tagged 95.9% of text cotag oly kow words, ad 89.7% of radomly selected text. 1 Itroducto art-of-speech taggg s the method to assg each word a text wth a tag descrbg ts proper part of speech the cotext. Two major problems the process s resolvg ambguous words ad classfyg ukow words. To solve ths task a few dfferet methods has bee developed. The most commo are taggg based o rules, ad probablstc, or stochastc, taggg based o statstcs. Ths paper presets a project for mplemetg a probablstc part-of-speech tagger for Swedsh. 2 robablstc OS taggg I probablstc part-of-speech taggg the tagger program s traed o a corpus aotated wth a certa tagset. Statstcs for the sequece of words ad tags s collected, ad s the used for estmatg the most probable taggg of utagged text. The backgroud theory s gve (Charak et al., 1993. A structve descrpto of the steps take ad decsos made mplemetg a effcet stochastc part-of-speech tagger s foud (Carlberger ad Ka, 1999, a paper whch has served as a major gude for the tal steps of ths project. 2.1 The basc model Ths secto cotas a bref overvew of the basc theory for developg the model. A text ca be cosdered as a sequece of radom varables W where each W, W, 1.. = 1 2... W varable ca have a value from a fxed set of words { w, w,.... For each sequece of word 1 2 w } varables there s a correspodg sequece of radom varables T, that ca take ther values from a fxed set of tags { t 1 1.., t 2,... t }. The taggg problem ca the be descrbed as fdg the most probable sequece of tags,, kowg the sequece of words, ths s: W 1.. w 1, T ( w = arg max ( t w 1, 1, t 1, t 1,. A formal defto of 1, From Bayes rule we kow that: ( A B ( B A ( A ( B =. Applyg ths o the rght had sde we get:

T ( w 1, = arg max t ( w1, t1, ( t1, ( w 1, 1, ad sce the deomator wll be costat for a gve problem the expresso s smplfed to: T ( w arg max ( w t ( t = (1 1, t1, 1, 1, 1, 2.2 Approxmatos for the mplemetato The probablty values eeded for solvg ths problem s acheved from had-aotated corpora. Eve a large corpus wll ot be large eough for collectg the statstcs eeded Equato 1. For ths reaso the problem s approxmated usg the followg two assumptos: ( t t, w ( t t t 1, 1 1, 1 = 2, 1 ( w t, w ( w t 1, 1, 1 = The frst assumpto states that the curret tag s oly depedet o the two prevous tags. The secod assumpto states that the curret word s oly depedet o the curret tag. The expresso for the problem ow becomes: T ( w, = arg max ( t t 2, t 1 ( 1 w t 1, = 1 ( t t 2, t 1 ( t t 1 t (2 Equato 2 allows us to use statstcs from every sequece of three tagged words, called trgrams, a corpus. Eve wth ths smplfcato the data wll usually be too sparse to produce a relable result. There are smply too may trgrams that wll ot appear or wll oly appear a small umber of tmes a corpus. We wll have to approxmate further by usg the probabltes for the sequece of two tagged words, bgrams: ad the case the bgram data wll be too sparse we wll also use the ugrams: ( t t ( t 1 Now we apply a useful strategy by terpolatg betwee these: ( t t t 1 2, ( t t 2, t 1 + λ2( t t 1 λ ( t λ + 1 3 where λ + λ + λ 1. 1 2 3 = uttg ths to Equato 2, we get: T ( w 1, = [ λ1 ( t t 2, t 1 + λ2( t t 1 + λ3( t ] ( w t = 1 (3 Fdg good values for the three lambdas s part of ths project. 2.3 Estmatg the probabltes For estmatg the probabltes Equato 3 the followg statstcs s eeded: C : the cout of all words C ( w : the cout of word w C ( t : the cout of tag t C( t 1,t 2 t 1 : the cout of bgrams where tag s followed by tag t 2 C( t, t t tag tag 1 2, t 1 t 3 ( w t tag t 3 : the cout of trgrams where s followed by tag t ad the by C, : the cout of word w tagged wth 2

T ( w 1, = arg max t Fgure 1: Equato 4 C λ ( t 2, t 1, t ( t, t C + λ ( t 1, t ( t 1 2 = C C 1, 1 2 1 1 C + λ3 C ( t C( w, t C( t The statstcs s the used to estmate the probabltes as follows: C = C ( t ( t t 1 = ( t t, t ( t ( 1, t ( C t C t 2 1 = ( w t C = 1 ( 2, t 1, t (, t C t C t 2 ( w, t C ( t 1 Substtutg ths Equato 3, we get the fal equato Fgure 1. Ths expresso ca easly be traslated to a computer program. Tables for the varous statstcs ca be bult from a aotated corpus, ad a sutable algorthm ca terate over the dfferet combatos of probabltes to fd the maxmum. The aïve mplemetato, makg a exhaustve search, wll ot suffce for seteces loger tha a few words though, sce the algorthms expaso s polyomal. The Vterb algorthm s usually appled ths stuato, as t s qute sestve to ay legth of a setece. 3 The SUC corpus The corpus used s the SUC 1.0 corpus, created betwee 1989-1996 as part of a research project by the Departmets of Lgustcs at Stockholm Uversty ad Umeå Uversty. The selecto of texts s a collecto of moder Swedsh prose frst publshed 1990 or later. Each of the more tha oe mllo words s aotated by the SUC tagset, whch s lsted appedx A. The corpus s dvded to 500 fles cotag roughly 2000 aotated words each. 4 The mplemetato The program was developed C++ usg the Stadard Template Lbrary for the tables. It s dvded to two major parts. The frst part bulds the statstcs tables by parsg a set of aotated fles. The secod part mplemets the algorthm Fgure 1 ad produces a suggesto of taggg for a text, oe setece at a tme. 4.1 Trag ad evaluato The goal for ths project was to mplemet the basc model for probablstc taggg a computer program. The program was to be traed o the SUC corpus, ad part of the corpus should be set asde for test ad evaluato. Ukow words should be hadled a smple maer. The corpus was dvded to 50 radomly selected fles each for the optmzg ad test sets, ad the remag 400 for the tra set. A complete evaluato of the program was performed as follows: The trag phase cosst of collectg the statstcs from the tra set. Ths meas coutg words, bgrams, trgrams etc ad store the result tables. I the optmzg phase each text the optmzg set s read ad tagged by the program usg the equato Fgure 1 wth dfferet sets of values for the three lambdas. The values used are betwee 0 ad 1 steps of 0.1. The taggg s the compared wth the correct oe, ad of course the lambda values gvg the best

result are the optmal values for that partcular trag set. A sample output ca be see appedx B. Durg the test phase the test set s tagged the same way, usg the optmal lambda values, ad the result s compared wth the correct tags. I the frst evaluato sesso the program oly hadled words kow from the test set. Ths was mplemeted the optmzg ad test phases by smply lettg the program skp seteces that cotaed ukow words. Two dfferet sets of lambda values produced the same best result for the optmzg set. The values [0.5 0.4 0.1] ad [0.7 0.2 0.1] for λ, 1 λ ad 2 λ 3 respectvely each resulted 95.90% correct taggg of the put. Ivestgatg both these lambda settgs for the test set resulted 95.98% ad 95.94% correct. For the secod evaluato sesso hadlg of ukow words was mplemeted as always taggg t as a ou sgular. The tag selected from the SUC tagset was NN UTR SIN IND NOM. The optmal lambda values were [0.6 0.3 0.1] wth a 89.85% correct taggg. The correspodg value for the test set was 89.24%. To mprove the result, the hadlg of ukow words was modfed to use formato from the bgram table bult durg the trag. For each of the possble tags the tagset the bgram table was searched for bgrams startg wth that tag. From the collecto of bgrams foud, the oe wth the hghest cout was selected. I ths bgram, the secod tag was selected as the best caddate for taggg a ukow word followg the frst tag. Ths was put to a best-from-bgram table, show appedx C. Ths tme the optmal lambda values were [0.5 0.4 0.1] wth 90.25% correct taggg. The correspodg value for the test set vas 89.73%. The output from the latter sesso s the cotet of appedx B. 4.2 Iteractve sesso Ths secto wll show a few examples of the program teractve mode taggg text from the cosole. > Jag ser e hök flyga. Jag N UTR SIN DEF SUB ser VB RS AKT e DT UTR SIN IND hök NN UTR SIN IND NOM flyga VB INF AKT. DL MAD The put le s Jag ser e hök flyga ( I see a hawk fly ad the tagger outputs oe le for each toke wth a suggested tag from the SUC tagset. All these words are kow from the trag, ad the taggg s correct. > Jag ser e falk flyga. Jag N UTR SIN DEF SUB ser VB RS AKT e DT UTR SIN IND falk NN UTR SIN IND NOM (? flyga VB INF AKT. DL MAD If we exchage hök ( hawk for falk ( falco the tagger ecouters a ukow word, as deoted by the questo mark. I ths case the best-from-bgram table s cosulted, ad a ou s suggested. > Jag ser e glada flyga. Jag N UTR SIN DEF SUB ser VB RS AKT e DT UTR SIN IND glada JJ OS UTR/NEU LU 1 flyga VB INF AKT. DL MAD Next, we chage the brd to glada ( kte. Ths word s also a adjectve Swedsh ( happy/merry. The trag set cotas the adjectve form, but ot the ou form. Ths results the tagger fdg the sequece Determer/Sgular Adjectve/lural the most probable, eve though t s a rare combato. I the test setece the sequece s part or two trgrams, [VB DT JJ] 2 ad [DT JJ VB]. A 1 Ths le s trucated. The tag s JJ OS UTR/NEU LU IND/DEF NOM. 2 These are abbrevatos for brevty. The fgures refer to the complete tags as show the example.

vestgato of the tables shows that the cout for those trgrams s 0 ad 1. The bgram cout s 1. If we vestgate the correspodg values for the prevous example, Determer/Sgular Nou/Sgular, the umber of trgrams are 850 for [VB DT NN] ad 49 for [DT NN VB]. The bgram cout s 8189. I fact, the word glada does appear oce as a ou the SUC corpus. If the tagger s traed o the full corpus t wll the tag the test setece above correctly. Ths of course a cocdet, ad the basc problem of complete formato of the possble taggg for a word remas. Usg the cosole put t s terestg to test how well the tagger hadles ambgutes. > Ser du m m? Ser VB RS AKT du N UTR SIN DEF SUB m S UTR SIN DEF m NN UTR SIN IND NOM? DL MAD Ser du m m? ( Do you see the expresso o my face?. The ambguous m s resolved cotext. A setece that s too dffcult s ot hard to costruct: > Var var var och e e gåg? Var N UTR SIN IND SUB/OBJ var VB RT AKT var VB RT AKT och KN e DT UTR SIN IND e DT UTR SIN IND gåg NN UTR SIN IND NOM? DL MAD Var var var och e e gåg? ( Where were each ad everyoe oce?. Several words ths setece have multple taggg possbltes. Table 1 cotas a lst of the words, ther possble tags ad the umber of occurreces of the combato word/tag. The tagger fals to catch the multword expresso var och e ( each ad everyoe. After trag the tagger o the full corpus the setece was tested aga: > Var var var och e e gåg? Var HA var VB RT AKT var N UTR SIN IND SUB/OBJ och KN e N UTR SIN IND SUB/OBJ e DT UTR SIN IND gåg NN UTR SIN IND NOM? DL MAD Ths tme the expresso was otced. Oce aga ths s a cocdetal, ad t s easy to fd smlar cases where the tagger wll fal. The dfferece betwee trag from the tra set ad full corpus s ear a 25% crease of parsed tokes (933629 vs. 1166896, ad the resultg mprovemets are a smple verfcato of the fact that, geeral, the larger the corpus the better. Var 35 HA 44 VB RT AKT 9 N UTR SIN IND SUB/OBJ 3 DT UTR SIN IND 10 VB IM AKT var 5070 VB RT AKT 58 N UTR SIN IND SUB/OBJ 132 HA 13 VB IM AKT 49 DT UTR SIN IND 1 AB 3 VB INF AKT 3 NN NEU SIN IND NOM och 25562 KN e 13139 DT UTR SIN IND 336 N UTR SIN IND SUB/OBJ 315 RG UTR SIN IND NOM 3 UO 8 AB 2 NN UTR SIN IND NOM gåg 433 NN UTR SIN IND NOM Table 1: ossble tags for words the example setece. 5 Cocluso The goal for ths project was to mplemet a model for probablstc part-of-speech taggg a computer program ad evaluate t usg the SUC corpus. The results of the trag are

very ecouragg, cosderg that the taggg algorthm s more or less a straghtforward mplemetato of the equato Fgure 1. I the troducto the two ma problems for part-of-speech taggg were metoed, ambguty ad ukow words. Resolvg ambguty s part of the desg of the probablstc model, ad as has bee brefly demostrated t geerally work better the larger the trag data s. Hadlg of ukow words s a part that has to be added to the model. I the curret mplemetato t s very rudmetary. Judgg from the dfferece result betwee the three evaluato sessos there s much to ga mprovg ths part of the program. Aother problem that appeared durg testg s the problem of the program havg a complete set of taggg optos for a toke. The tagger wll produce a taggg based o avalable data, of course, but as show the example the result would sometmes have bee better hadled by the ukow word route. Ths s a hard problem to solve, ad would requre some heurstcs brdgg taggg based o kow words ad taggg suggested by the hadler of ukow words. A dfferet, ad complemetary, approach s of course tryg to avod the problem by gvg the tagger as complete trag data as possble. Refereces Carlberger, J. ad V. Ka. Implemetg a effcet part-of-speech tagger. Software ractce ad Experece, 29(9:815 832, 1999. E. Charak, C. Hedrckso, N. Jacobso, ad M. erkowtz. Equatos for part-of-speech taggg. I 11th Natoal Cof. Artfcal Itellgece, :784 789,1993.

A. The SUC tagset. Category Code AB DL DT HA HD H HS IE IN JJ KN NN C L M N S RG RO SN UO VB Feature Code Category Adverb Delmter (uctuato Determer Iterrogatve/Relatve Adverb Iterrogatve/Relatve Determer Iterrogatve/Relatve roou Iterrogatve/Relatve ossessve Iftve Marker Iterjecto Adjectve Cojucto Nou artcple artcle roper Nou roou reposto ossessve Cardal Number Ordal Number Subjucto Foreg Word Verb Feature UTR Commo (Utrum Geder NEU Neutre Geder MAS Mascule Geder UTR/NEU Uderspecfed Geder - Uspecfed Geder SIN Sgular Number LU lural Number SIN/LU Uderspecfed Number - Uspecfed Number IND Idefte Defteess DEF Defte Defteess IND/DEF Uderspecfed Defteess - Uspecfed Defteess NOM Nomatve Case GEN Getve Case SMS Compoud Case - Uspecfed Case OS ostve Degree KOM Comparatve Degree SUV Superlatve Degree SUB Subject roou Form OBJ Object roou Form

SUB/OBJ Uderspecfed roou Form RS reset Verb Form RT reterte Verb Form INF Iftve Verb Form SU Supum Verb Form IM Imperatve Verb Form AKT Actve Voce SFO S-form Voce KON Subjuctve Mood RF erfect erfect AN Abbrevato Form

B. Output from test sesso, ukow word = best-from-bgram [ 0.0 0.0 1.0 ] total: 116190, correct:98263, = 84.5710 % [ 0.0 0.1 0.9 ] total: 116190, correct:101528, = 87.3810 % [ 0.0 0.2 0.8 ] total: 116190, correct:102307, = 88.0515 % [ 0.0 0.3 0.7 ] total: 116190, correct:102820, = 88.4930 % [ 0.0 0.4 0.6 ] total: 116190, correct:103198, = 88.8183 % [ 0.0 0.5 0.5 ] total: 116190, correct:103441, = 89.0275 % [ 0.0 0.6 0.4 ] total: 116190, correct:103594, = 89.1591 % [ 0.0 0.7 0.3 ] total: 116190, correct:103741, = 89.2857 % [ 0.0 0.8 0.2 ] total: 116190, correct:103849, = 89.3786 % [ 0.0 0.9 0.1 ] total: 116190, correct:103920, = 89.4397 % [ 0.0 1.0 0.0 ] total: 116190, correct:103698, = 89.2486 % [ 0.1 0.0 0.9 ] total: 116190, correct:102115, = 87.8862 % [ 0.1 0.1 0.8 ] total: 116190, correct:102752, = 88.4345 % [ 0.1 0.2 0.7 ] total: 116190, correct:103157, = 88.7830 % [ 0.1 0.3 0.6 ] total: 116190, correct:103445, = 89.0309 % [ 0.1 0.4 0.5 ] total: 116190, correct:103641, = 89.1996 % [ 0.1 0.5 0.4 ] total: 116190, correct:103842, = 89.3726 % [ 0.1 0.6 0.3 ] total: 116190, correct:103937, = 89.4543 % [ 0.1 0.7 0.2 ] total: 116190, correct:104050, = 89.5516 % [ 0.1 0.8 0.1 ] total: 116190, correct:104106, = 89.5998 % [ 0.1 0.9 0.0 ] total: 116190, correct:104023, = 89.5284 % [ 0.2 0.0 0.8 ] total: 116190, correct:102868, = 88.5343 % [ 0.2 0.1 0.7 ] total: 116190, correct:103289, = 88.8966 % [ 0.2 0.2 0.6 ] total: 116190, correct:103576, = 89.1436 % [ 0.2 0.3 0.5 ] total: 116190, correct:103758, = 89.3003 % [ 0.2 0.4 0.4 ] total: 116190, correct:103927, = 89.4457 % [ 0.2 0.5 0.3 ] total: 116190, correct:104021, = 89.5266 % [ 0.2 0.6 0.2 ] total: 116190, correct:104103, = 89.5972 % [ 0.2 0.7 0.1 ] total: 116190, correct:104152, = 89.6394 % [ 0.2 0.8 0.0 ] total: 116190, correct:104023, = 89.5284 % [ 0.3 0.0 0.7 ] total: 116190, correct:103266, = 88.8768 % [ 0.3 0.1 0.6 ] total: 116190, correct:103612, = 89.1746 % [ 0.3 0.2 0.5 ] total: 116190, correct:103800, = 89.3364 % [ 0.3 0.3 0.4 ] total: 116190, correct:103941, = 89.4578 % [ 0.3 0.4 0.3 ] total: 116190, correct:104076, = 89.5740 % [ 0.3 0.5 0.2 ] total: 116190, correct:104166, = 89.6514 % [ 0.3 0.6 0.1 ] total: 116190, correct:104200, = 89.6807 % [ 0.3 0.7 0.0 ] total: 116190, correct:104046, = 89.5482 % [ 0.4 0.0 0.6 ] total: 116190, correct:103517, = 89.0929 % [ 0.4 0.1 0.5 ] total: 116190, correct:103787, = 89.3252 % [ 0.4 0.2 0.4 ] total: 116190, correct:103955, = 89.4698 % [ 0.4 0.3 0.3 ] total: 116190, correct:104070, = 89.5688 % [ 0.4 0.4 0.2 ] total: 116190, correct:104194, = 89.6755 % [ 0.4 0.5 0.1 ] total: 116190, correct:104244, = 89.7186 % [ 0.4 0.6 0.0 ] total: 116190, correct:104071, = 89.5697 % [ 0.5 0.0 0.5 ] total: 116190, correct:103704, = 89.2538 % [ 0.5 0.1 0.4 ] total: 116190, correct:103927, = 89.4457 % [ 0.5 0.2 0.3 ] total: 116190, correct:104051, = 89.5525 % [ 0.5 0.3 0.2 ] total: 116190, correct:104192, = 89.6738 % [ 0.5 0.4 0.1 ] total: 116190, correct:104258, = 89.7306 % [ 0.5 0.5 0.0 ] total: 116190, correct:104103, = 89.5972 % [ 0.6 0.0 0.4 ] total: 116190, correct:103817, = 89.3511 % [ 0.6 0.1 0.3 ] total: 116190, correct:104026, = 89.5309 % [ 0.6 0.2 0.2 ] total: 116190, correct:104157, = 89.6437 %

[ 0.6 0.3 0.1 ] total: 116190, correct:104247, = 89.7211 % [ 0.6 0.4 0.0 ] total: 116190, correct:104089, = 89.5852 % [ 0.7 0.0 0.3 ] total: 116190, correct:103917, = 89.4371 % [ 0.7 0.1 0.2 ] total: 116190, correct:104118, = 89.6101 % [ 0.7 0.2 0.1 ] total: 116190, correct:104207, = 89.6867 % [ 0.7 0.3 0.0 ] total: 116190, correct:104086, = 89.5826 % [ 0.8 0.0 0.2 ] total: 116190, correct:103957, = 89.4716 % [ 0.8 0.1 0.1 ] total: 116190, correct:104148, = 89.6359 % [ 0.8 0.2 0.0 ] total: 116190, correct:104065, = 89.5645 % [ 0.9 0.0 0.1 ] total: 116190, correct:103966, = 89.4793 % [ 0.9 0.1 0.0 ] total: 116190, correct:104023, = 89.5284 % [ 1.0 0.0 0.0 ] total: 116190, correct:103242, = 88.8562 %

C. best-from-bgram table Collected from the bgram table, bult durg trag.the kow tag pots to the most commo tag followg t. Sorted by the umber of occurreces the tra set. 14442: NN UTR SIN IND NOM -->> 12567: -->> NN UTR SIN DEF NOM 8853: JJ OS UTR SIN IND NOM -->> NN UTR SIN IND NOM 8830: IE -->> VB INF AKT 8189: DT UTR SIN IND -->> NN UTR SIN IND NOM 7524: NN UTR SIN DEF NOM -->> 7235: VB RS AKT -->> AB 6547: <s> -->> 6227: NN UTR LU IND NOM -->> 6160: JJ OS UTR/NEU LU IND/DEF NOM -->> NN UTR LU IND NOM 5912: AB -->> 5876: M NOM -->> M NOM 5761: NN NEU SIN IND NOM -->> 5052: N UTR SIN DEF SUB -->> VB RT AKT 4524: DL MID -->> KN 4377: N NEU SIN DEF SUB/OBJ -->> VB RS AKT 3680: JJ OS UTR/NEU SIN DEF NOM -->> NN UTR SIN DEF NOM 3678: H - - - -->> VB RS AKT 3601: VB RT AKT -->> AB 3595: KN -->> NN UTR SIN IND NOM 3546: DT NEU SIN IND -->> NN NEU SIN IND NOM 3378: VB INF AKT -->> 3329: DT UTR SIN DEF -->> JJ OS UTR/NEU SIN DEF NOM 3213: JJ OS NEU SIN IND NOM -->> NN NEU SIN IND NOM 3009: NN NEU SIN DEF NOM -->> 2857: DL MAD -->> DL MID 2846: L -->> 2563: RG NOM -->> NN UTR LU IND NOM 2510: NN NEU LU IND NOM -->> 2086: VB RS SFO -->> 2083: SN -->> N UTR SIN DEF SUB 2067: S UTR SIN DEF -->> NN UTR SIN IND NOM 2007: NN UTR LU DEF NOM -->> 1988: DT UTR/NEU LU DEF -->> JJ OS UTR/NEU LU IND/DEF NOM 1717: DL AD -->> DL MAD 1683: DT NEU SIN DEF -->> JJ OS UTR/NEU SIN DEF NOM 1513: UO -->> UO 1490: AB OS -->> 1456: VB INF SFO -->> 1434: N UTR LU DEF SUB -->> VB RS AKT 1393: VB SU AKT -->> 1377: N UTR SIN IND SUB -->> VB RS AKT 1339: N UTR/NEU SIN/LU DEF OBJ -->> 1250: C RF UTR SIN IND NOM -->> NN UTR SIN IND NOM 1232: NN UTR SIN DEF GEN -->> NN UTR SIN IND NOM 1176: HA -->> N UTR SIN DEF SUB 1138: M GEN -->> NN UTR SIN IND NOM 970: N UTR/NEU LU DEF SUB -->> VB RS AKT 924: C RS UTR/NEU SIN/LU IND/DEF NOM -->> NN UTR SIN IND NOM 923: S UTR/NEU SIN/LU DEF -->> NN UTR SIN IND NOM 893: VB RT SFO -->> 871: S NEU SIN DEF -->> NN NEU SIN IND NOM 849: C RF UTR/NEU LU IND/DEF NOM -->> NN UTR LU IND NOM 821: VB SU SFO -->>

725: JJ KOM UTR/NEU SIN/LU IND/DEF NOM -->> NN UTR SIN IND NOM 721: NN NEU LU DEF NOM -->> 681: IN -->> DL MID 673: S UTR/NEU LU DEF -->> NN UTR LU IND NOM 619: N UTR SIN DEF OBJ -->> DL MAD 607: N UTR SIN DEF SUB/OBJ -->> VB RS AKT 574: JJ OS UTR/NEU SIN/LU IND/DEF NOM -->> NN UTR SIN IND NOM 557: N NEU SIN IND SUB/OBJ -->> 517: NN NEU SIN DEF GEN -->> NN UTR SIN IND NOM 502: H NEU SIN IND -->> VB RS AKT 481: JJ OS UTR/NEU LU IND NOM -->> NN UTR LU IND NOM 444: C RF NEU SIN IND NOM -->> NN NEU SIN IND NOM 444: DT UTR/NEU LU IND -->> NN UTR LU IND NOM 436: NN AN -->> RG NOM 433: C RF UTR/NEU SIN DEF NOM -->> NN UTR SIN DEF NOM 425: N UTR SIN IND SUB/OBJ -->> 417: AB KOM -->> KN 385: RO NOM -->> NN UTR SIN DEF NOM 359: NN UTR - - SMS -->> KN 343: NN UTR LU DEF GEN -->> NN UTR SIN IND NOM 335: DT UTR/NEU SIN IND -->> NN UTR SIN IND NOM 312: DT UTR/NEU LU IND/DEF -->> NN UTR LU IND NOM 305: DT UTR/NEU SIN/LU IND -->> NN UTR SIN IND NOM 301: N UTR/NEU LU DEF OBJ -->> DL MAD 290: JJ SUV UTR/NEU SIN/LU DEF NOM -->> NN UTR SIN DEF NOM 289: N UTR/NEU LU IND SUB/OBJ -->> 285: JJ OS MAS SIN DEF NOM -->> NN UTR SIN DEF NOM 281: JJ OS UTR SIN IND/DEF NOM -->> NN UTR SIN IND NOM 243: AB SUV -->> 200: NN UTR SIN IND GEN -->> NN UTR SIN IND NOM 196: AB AN -->> RG NOM 175: VB IM AKT -->> AB 156: RG UTR SIN IND NOM -->> NN UTR SIN IND NOM 146: N UTR LU DEF OBJ -->> 141: NN NEU - - SMS -->> KN 139: N UTR/NEU LU DEF SUB/OBJ -->> VB RS AKT 131: NN UTR LU IND GEN -->> NN UTR SIN IND NOM 130: NN NEU SIN IND GEN -->> NN UTR SIN IND NOM 126: HD UTR SIN IND -->> NN UTR SIN IND NOM 115: JJ OS NEU SIN IND/DEF NOM -->> NN NEU SIN IND NOM 108: NN NEU LU DEF GEN -->> NN UTR SIN IND NOM 104: DT UTR SIN IND/DEF -->> NN UTR SIN IND NOM 103: RG UTR/NEU SIN DEF NOM -->> NN UTR SIN DEF NOM 95: RG NEU SIN IND NOM -->> NN NEU SIN IND NOM 92: JJ SUV UTR/NEU SIN/LU IND NOM -->> 92: H UTR SIN IND -->> VB RS AKT 86: HD UTR/NEU LU IND -->> NN UTR LU IND NOM 75: NN NEU LU IND GEN -->> NN UTR SIN IND NOM 59: KN AN -->> M NOM 58: RG SMS -->> KN 56: JJ SUV UTR/NEU LU DEF NOM -->> 56: HD NEU SIN IND -->> NN NEU SIN IND NOM 52: HS DEF -->> NN UTR SIN IND NOM 33: NN - - - - -->> DL MAD 33: JJ OS MAS SIN DEF GEN -->> NN UTR SIN IND NOM 32: H UTR/NEU LU IND -->> VB RS AKT 31: M SMS -->> KN 30: NN UTR - - - -->> 30: DT NEU SIN IND/DEF -->> NN NEU SIN IND NOM 27: C RF MAS SIN DEF NOM -->> NN UTR SIN DEF NOM

27: NN - - - SMS -->> KN 25: JJ OS UTR - - SMS -->> KN 23: VB KON RS AKT -->> N UTR/NEU SIN/LU DEF OBJ 22: N MAS SIN DEF SUB/OBJ -->> VB RT AKT 19: VB AN -->> M NOM 14: VB KON RT AKT -->> JJ OS NEU SIN IND NOM 14: JJ SUV MAS SIN DEF NOM -->> NN UTR SIN IND NOM 14: JJ OS UTR/NEU LU IND/DEF GEN -->> NN UTR SIN IND NOM 14: AB SMS -->> KN 13: DT MAS SIN DEF -->> NN UTR SIN IND NOM 12: L SMS -->> KN 11: JJ OS UTR/NEU SIN DEF GEN -->> NN UTR SIN IND NOM 9: JJ AN -->> NN UTR SIN IND NOM 8: RO MAS SIN IND/DEF NOM -->> NN UTR SIN IND NOM 8: RG MAS SIN DEF NOM -->> H - - - 6: JJ SUV UTR/NEU LU IND NOM -->> NN UTR LU IND NOM 5: JJ OS UTR/NEU SIN/LU IND NOM -->> NN UTR LU IND NOM 4: RO GEN -->> NN UTR SIN IND NOM 4: C RF UTR/NEU LU IND/DEF GEN -->> NN UTR SIN IND NOM 4: DT UTR/NEU SIN DEF -->> NN UTR SIN DEF NOM 4: DT MAS SIN IND -->> NN UTR SIN IND NOM 3: VB SMS -->> KN 3: RG GEN -->> NN UTR SIN IND NOM 3: NN NEU - - - -->> DL MAD 3: JJ OS UTR/NEU - - SMS -->> KN 2: RO SMS -->> KN 2: AN -->> NN UTR SIN IND NOM 2: C RF UTR/NEU SIN DEF GEN -->> NN UTR SIN IND NOM 2: C RF MAS SIN DEF GEN -->> NN UTR SIN IND NOM 2: JJ KOM UTR/NEU SIN/LU IND/DEF SMS -->> KN 2: JJ KOM UTR/NEU SIN/LU IND/DEF GEN -->> NN UTR SIN IND NOM 1: VB KON RT SFO -->> DT NEU SIN IND 1: VB IM SFO -->> 1: C RS UTR/NEU SIN/LU IND/DEF GEN -->> NN UTR SIN DEF NOM 1: C RF UTR SIN IND GEN -->> NN UTR LU IND NOM 1: C AN -->> NN UTR SIN IND NOM 1: JJ SUV MAS SIN DEF GEN -->> JJ OS UTR/NEU SIN DEF NOM 1: JJ OS UTR SIN IND GEN -->> NN UTR SIN IND NOM 1: H NEU SIN IND SMS -->> KN 1: DT AN -->> RG NOM