POS Tagging for Historical Texts with Sparse Training Data
|
|
- Barnard Willis
- 7 years ago
- Views:
Transcription
1 Motivation POS Tagging for Historical Texts with Sparse Training Data Department of Linguistics Ruhr-University Bochum, Germany The 7th Linguistic Annotation Workshop & Interoperability with Discourse August 8 9, 2013, Sofia, Bulgaria
2 Motivation Motivation Goal (Semi-)Automatic annotation of historical texts The (main) problem with historical data... High variance in spelling None or very little training data to retrain existing tools woman vrowe vrouwe fraw frouw frauwe
3 Motivation Motivation Goal (Semi-)Automatic annotation of historical texts The (main) problem with historical data... High variance in spelling None or very little training data to retrain existing tools A possible solution... Spelling normalization as a preprocessing step woman vrowe vrouwe fraw frouw frauwe frau
4 Outline Data Motivation 1 Data Anselm Corpus GerManC-GS Corpus 2 Method & Procedure Results 3 Capitalization & Punctuation Results 4
5 Anselm Corpus GerManC-GS Corpus Anselm Corpus Collection of Early New High German (ENHG) texts Interrogatio Sancti Anselmi de Passione Domini (Questions by Saint Anselm about the Lord s Passion) More than 50 manuscripts and prints (in German) 14 th 16 th centuries Various German dialects Sample from an Anselm manuscript
6 Anselm Corpus Data Anselm Corpus GerManC-GS Corpus ENHG 1 do meín chind híet geezzen... ENHG 2 Do my kynt hatte geſzen... ENHG 3 do mín kínt hatt geſſen... Norm da mein kind hatte gegessen... as my child had eaten
7 Anselm Corpus GerManC-GS Corpus GerManC-GS Corpus GerManC Created at the University of Manchester Representative corpus of historical, written German from 1650 to 1800 Different dialectal regions and text genres GerManC-GS Subcorpus of GerManC with gold standard annotations, lemmatization, POS
8 GerManC-GS Corpus Anselm Corpus GerManC-GS Corpus Serm 1 es ist ein k e ostlich Ding, Dir dancken Norm es ist ein köstliches Ding, dir (zu) danken it is an exquisite thing to thank you Serm 2 Norm gieb meinen Worten das Feuer, das die Herzen entz e undet gib meinen Worten das Feuer, das die Herzen entzündet give my words the fire to ignite hearts
9 Texts used for the evaluation Anselm Corpus GerManC-GS Corpus Corpus Date Name Tokens Anselm GerManC-GS 15c Berlin 5,399 15c Melk 4, LeichSermon 2, JubelFeste 2, Gottesdienst 2,292
10 Method & Procedure Results methods Described previously in Bollmann (2012) Combination of different normalization methods 1 Wordlist mapping 2 Rule-based normalization Character rewrite rules 3 Distance-based normalization Weighted Levenshtein distance
11 procedure Method & Procedure Results Training & evaluation parts as subsets from the same text How much training data is needed? Different sizes of the training parts Random sub-sampling n tokens for training 1,000 tokens for evaluation Average of 10 random training & evaluation sets
12 accuracy Method & Procedure Results Text Baseline s ,000 Berlin 23.05% 68.99% 75.02% 79.14% 81.83% Melk 39.32% 69.10% 74.39% 75.74% 77.98% LeichSermon 72.71% 77.96% 80.51% 82.85% 87.23% JubelFeste 79.47% 88.50% 89.98% 91.87% 93.13% Gottesdienst 83.41% 93.77% 95.24% 95.27% 95.56%
13 accuracy Method & Procedure Results Text Baseline s ,000 Berlin 23.05% 68.99% 75.02% 79.14% 81.83% Melk 39.32% 69.10% 74.39% 75.74% 77.98% LeichSermon 72.71% 77.96% 80.51% 82.85% 87.23% JubelFeste 79.47% 88.50% 89.98% 91.87% 93.13% Gottesdienst 83.41% 93.77% 95.24% 95.27% 95.56%
14 accuracy Method & Procedure Results Text Baseline s ,000 Berlin 23.05% 68.99% 75.02% 79.14% 81.83% Melk 39.32% 69.10% 74.39% 75.74% 77.98% LeichSermon 72.71% 77.96% 80.51% 82.85% 87.23% JubelFeste 79.47% 88.50% 89.98% 91.87% 93.13% Gottesdienst 83.41% 93.77% 95.24% 95.27% 95.56%
15 Capitalization & Punctuation Results How good is POS tagging on spelling-normalized data?
16 Capitalization & Punctuation Results Spelling variation is not the only problem... Inconsistent or missing capitalization Inconsistent or missing punctuation marks Extinct wordforms Syntactic pecularities...?
17 Capitalization & Punctuation Results Spelling variation is not the only problem... Inconsistent or missing capitalization Inconsistent or missing punctuation marks Extinct wordforms Syntactic pecularities...?
18 Capitalization & Punctuation Results Tagging with handicaps on modern data Combination of two modern German corpora: TIGER corpus (Brants et al., 2002) Tüba-D/Z version 6 (Telljohann et al, 2004) Original 96.85% Lowercased 96.50% No punctuation and SB 96.22% Lowercased + no punctuation and SB 95.74% Tagging accuracy with 10-fold CV, using RFTagger (Schmid and Laws, 2008)
19 Capitalization & Punctuation Results Tagging with handicaps on modern data Combination of two modern German corpora: TIGER corpus (Brants et al., 2002) Tüba-D/Z version 6 (Telljohann et al, 2004) Original 96.85% Lowercased 96.50% No punctuation and SB 96.22% Lowercased + no punctuation and SB 95.74% Tagging without capitalization/punctuation is viable
20 Tagging on historical data Capitalization & Punctuation Results Text Orig. Automatically normalized Gold ,000 Berlin 28.65% 58.68% 74.89% 75.95% 78.03% 87.07% Melk 44.70% 69.63% 74.02% 76.24% 78.66% 87.74% LeichSermon 67.95% 72.87% 74.63% 75.85% 78.01% 81.04% JubelFeste 82.26% 82.64% 83.62% 86.52% 87.74% 90.03% Gottesdienst 88.07% 88.84% 90.27% 91.30% 91.65% 92.27%
21 Tagging on historical data Capitalization & Punctuation Results Average accuracy (%) POS Tagging Average accuracy (%) POS Tagging Size of training part (Tokens) Size of training part (Tokens) Melk JubelFeste
22 Problems that remain... Capitalization & Punctuation Results Extinct wordforms vn machot in zehant geſvnt. und macht ihn sofort gesund and cures him immediately
23 Problems that remain... Capitalization & Punctuation Results Extinct wordforms vn machot in zehant geſvnt. und macht ihn sofort gesund and cures him immediately Syntactic/semantic variation die faelle so aus schwacheit geschehen die fälle so/die? aus schwachheit geschehen the cases that occur out of weakness
24 Problems that remain... Capitalization & Punctuation Results Domain adaptation sieh anselm Look, Anselm! Imperative verb forms rare in modern corpora TIGER/Tüba: 0.02% Berlin text: 0.91% Religious vocabulary
25 Conclusion Data Conclusion Automatic annotation of historical data Dealing with spelling variation via normalization Small amounts of training data already very beneficial, e.g. from 23% to 69% accuracy with 100 tokens for training POS tagging on data without capitalization, punctuation, and sentence boundaries Only minor impact on accuracy (1.1% on modern data) Syntactic/semantic variation and domain adaptation remain obstacles for improving the results
26 Conclusion Thank you for listening!
27 References Data References Details Bollmann, M. (2012). (Semi-)automatic normalization of historical texts using distance measures and the Norma tool. In Proceedings of ACRH-2, Lisbon, Portugal. Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In E. Hinrichs & K. Simov (Eds.), Proceedings of TLT 2002, Sozopol, Bulgaria. Schmid, J., & Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In Proceedings of COLING 08, Manchester, UK. Telljohann, H., Hinrichs, E., & Kübler, S. (2004). The Tüba-D/Z treebank: annotating German with a context-free backbone. In Proceedings of LREC 2004, Lisbon, Portugal.
28 References Details Methods Wordlist Mapping Word-to-word mappings Learned from an aligned corpus Chooses most frequent candidate wordform No knowledge about spelling variation Example do da 50 meín mein 30 myn mein 30 mín mein 30. hatt hatte 50 hatt hat 20 hatt hut 1
29 References Details Methods Rule-Based Context-aware character rewrite rules v u / # _ n v n d u n d Learned from aligned training corpus Levenshtein distance: Minimum number of edit operations to transform string a into string b Modified algorithm: Outputs the actual edit operations
30 References Details Methods Rule-Based Substitution rules v u / # _ n Identity rules n n / e _ # Insertion rules ε l / o _ l Deletion rules f ε / u _ f Additional lexicon lookup to prevent nonsense words
31 References Details Methods Rule-Based Substitution rules v u / # _ n Identity rules n n / e _ # Insertion rules ε l / o _ l Deletion rules f ε / u _ f Identity and non-identity rules intended to compete Additional lexicon lookup to prevent nonsense words
32 References Details Methods Distance-Based Levenshtein distance: Count number of edit operations myn mein d = 2
33 References Details Methods Distance-Based Levenshtein distance: Count number of edit operations myn mein d = 2 Weighted Levenshtein distance Assigns weights to edit operations e.g., d( y, ei ) = 0.8 Edit operations are directed/asymmetric Edit operations may span multiple characters myn mein d = 0.8
34 References Details Methods Distance-Based Find lexicon entry with lowest distance to input string myn... main mein meine meins mine mini mimik...
35 References Details Methods Distance-Based Find lexicon entry with lowest distance to input string myn... main mein meine meins mine mini mimik...
36 References Details Methods Distance-Based Find lexicon entry with lowest distance to input string myn... main mein meine meins mine mini mimik...
37 References Details Combining Methods Combining methods shown to be beneficial Chain combination of normalizers 1 Wordlist mapping 2 Rule-based normalization 3 Weighted Levenshtein distance Better than other orderings Better than majority-vote approach
38 References Details Combining Methods Wordlist Mapping Success? yes no Rule-Based Done! Success? yes no Weighted Levenshtein Distance
CorA: A web-based annotation tool for historical and other non-standard language data
CorA: A web-based annotation tool for historical and other non-standard language data Marcel Bollmann, Florian Petran, Stefanie Dipper, Julia Krasselt Department of Linguistics Ruhr-University Bochum,
More informationAutomatic Detection and Correction of Errors in Dependency Treebanks
Automatic Detection and Correction of Errors in Dependency Treebanks Alexander Volokh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany alexander.volokh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg
More information31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
More informationSearch and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
More informationComplex Predications in Argument Structure Alternations
Complex Predications in Argument Structure Alternations Stefan Engelberg (Institut für Deutsche Sprache & University of Mannheim) Stefan Engelberg (IDS Mannheim), Universitatea din Bucureşti, November
More informationTesting Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
More informationCollecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
More informationCINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
More informationTrameur: A Framework for Annotated Text Corpora Exploration
Trameur: A Framework for Annotated Text Corpora Exploration Serge Fleury (Sorbonne Nouvelle Paris 3) serge.fleury@univ-paris3.fr Maria Zimina(Paris Diderot Sorbonne Paris Cité) maria.zimina@eila.univ-paris-diderot.fr
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationTerminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
More informationLearning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
More informationA Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu
More informationAuthor Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
More informationBITS: A Method for Bilingual Text Search over the Web
BITS: A Method for Bilingual Text Search over the Web Xiaoyi Ma, Mark Y. Liberman Linguistic Data Consortium 3615 Market St. Suite 200 Philadelphia, PA 19104, USA {xma,myl}@ldc.upenn.edu Abstract Parallel
More informationSyntactic Transfer Using a Bilingual Lexicon
Syntactic Transfer Using a Bilingual Lexicon Greg Durrett, Adam Pauls, and Dan Klein UC Berkeley Parsing a New Language Parsing a New Language Mozambique hope on trade with other members Parsing a New
More informationAutomatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
More informationDatabase Design For Corpus Storage: The ET10-63 Data Model
January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million
More informationEfficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
More informationBuilding gold-standard treebanks for Norwegian
Building gold-standard treebanks for Norwegian Per Erik Solberg National Library of Norway, P.O.Box 2674 Solli, NO-0203 Oslo, Norway per.solberg@nb.no ABSTRACT Språkbanken at the National Library of Norway
More informationPOSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
More informationVerb-cluster variations: A Harmonic Grammar analysis
Verb-cluster variations: A Harmonic Grammar analysis Markus Bader Goethe-Universität Frankfurt New Ways of Analyzing Syntactic Variation Radboud University Nijmegen November 15.-17, 2012 Introduction In
More informationA Joint Sequence Translation Model with Integrated Reordering
A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation
More informationHow To Write A Book On A Historical And Historical Corpus
Multiple Tokenizations in a Diachronic Corpus - Corpus Demo Session Ridges Herbology Thomas Krause, Anke Lüdeling, Carolin Odebrecht& Amir Zeldes Corpus linguistic working group Korpuslinguistik& Morphologie,
More informationTowards exploring the specific influences of wordform frequency, lemma frequency and OLD20 on visual word recognition and reading aloud
Towards exploring the specific influences of wordform frequency, lemma frequency and OLD20 on visual word recognition and reading aloud Lara Kresse, Stefan Kirschner, Stefanie Dipper, Eva Belke It is well
More informationImproving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction
Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Uwe D. Reichel Department of Phonetics and Speech Communication University of Munich reichelu@phonetik.uni-muenchen.de Abstract
More informationApplications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories
Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories Contents. 1. Motivation 2. Scenarios 2.1 Voice box / call-back 2.2 Quality management 3. Technology
More informationThe PALAVRAS parser and its Linguateca applications - a mutually productive relationship
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark eckhard.bick@mail.dk Outline Flow chart Linguateca Palavras History
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationBerlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects
More informationModule 6 Other OCR engines: ABBYY, Tesseract
Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 1 / 20 Module 6 Other OCR engines: ABBYY, Tesseract Uwe Springmann Centrum für Informations- und Sprachverarbeitung (CIS) Ludwig-Maximilians-Universität
More informationA prototype infrastructure for D Spin Services based on a flexible multilayer architecture
A prototype infrastructure for D Spin Services based on a flexible multilayer architecture Volker Boehlke 1,, 1 NLP Group, Department of Computer Science, University of Leipzig, Johanisgasse 26, 04103
More informationStatistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
More informationConvergence of Translation Memory and Statistical Machine Translation
Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past
More informationMotivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
More informationHow to make Ontologies self-building from Wiki-Texts
How to make Ontologies self-building from Wiki-Texts Bastian HAARMANN, Frederike GOTTSMANN, and Ulrich SCHADE Fraunhofer Institute for Communication, Information Processing & Ergonomics Neuenahrer Str.
More informationWebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen firstname.lastname@uni-tuebingen.de Abstract This software
More informationContext Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 dick.chen@lexisnexis.com don.loritz@lexisnexis.com
More informationTibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
More informationProsodic Phrasing: Machine and Human Evaluation
Prosodic Phrasing: Machine and Human Evaluation M. Céu Viana*, Luís C. Oliveira**, Ana I. Mata***, *CLUL, **INESC-ID/IST, ***FLUL/CLUL Rua Alves Redol 9, 1000 Lisboa, Portugal mcv@clul.ul.pt, lco@inesc-id.pt,
More informationTopological Field Chunking in German
Topological Field Chunking in German Jorn Veenstra, Frank H. Müller, Tylman Ule [veenstra,fhm,ule]@sfs.uni-tuebingen.de ESSLLI Summerschool Workshop on Machine Learning Approaches in Computational Linguistics
More informationEnriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de
More informationChildFreq: An Online Tool to Explore Word Frequencies in Child Language
LUCS Minor 16, 2010. ISSN 1104-1609. ChildFreq: An Online Tool to Explore Word Frequencies in Child Language Rasmus Bååth Lund University Cognitive Science Kungshuset, Lundagård, 222 22 Lund rasmus.baath@lucs.lu.se
More informationTerminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
More informationGrading Benchmarks FIRST GRADE. Trimester 4 3 2 1 1 st Student has achieved reading success at. Trimester 4 3 2 1 1st In above grade-level books, the
READING 1.) Reads at grade level. 1 st Student has achieved reading success at Level 14-H or above. Student has achieved reading success at Level 10-F or 12-G. Student has achieved reading success at Level
More informationChapter 2 The Information Retrieval Process
Chapter 2 The Information Retrieval Process Abstract What does an information retrieval system look like from a bird s eye perspective? How can a set of documents be processed by a system to make sense
More informationCoupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort
Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort Pascal Denis and Benoît Sagot Equipe-project ALPAGE INRIA and Université Paris 7 30, rue
More informationInterpreting areading Scaled Scores for Instruction
Interpreting areading Scaled Scores for Instruction Individual scaled scores do not have natural meaning associated to them. The descriptions below provide information for how each scaled score range should
More informationELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING
ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING Prashant D. Abhonkar 1, Preeti Sharma 2 1 Department of Computer Engineering, University of Pune SKN Sinhgad Institute of Technology & Sciences,
More informationExtraction and Visualization of Protein-Protein Interactions from PubMed
Extraction and Visualization of Protein-Protein Interactions from PubMed Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin Finding Relevant Knowledge Find information about Much
More informationCorpus Design for a Unit Selection Database
Corpus Design for a Unit Selection Database Norbert Braunschweiler Institute for Natural Language Processing (IMS) Stuttgart 8 th 9 th October 2002 BITS Workshop, München Norbert Braunschweiler Corpus
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationComputer-aided Document Indexing System
Journal of Computing and Information Technology - CIT 13, 2005, 4, 299-305 299 Computer-aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić and Jan Šnajder,, An enormous
More informationTowards a Data Model for the Universal Corpus
Towards a Data Model for the Universal Corpus Steven Abney University of Michigan abney@umichedu Steven Bird University of Melbourne and University of Pennsylvania sbird@unimelbeduau Abstract We describe
More informationEvalita 09 Parsing Task: constituency parsers and the Penn format for Italian
Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian Cristina Bosco, Alessandro Mazzei, and Vincenzo Lombardo Dipartimento di Informatica, Università di Torino, Corso Svizzera
More informationChapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
More informationComputer Aided Document Indexing System
Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia
More informationPoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
More informationModern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability
Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability Ana-Maria Popescu Alex Armanasu Oren Etzioni University of Washington David Ko {amp, alexarm, etzioni,
More informationCoffee Break German Lesson 06
LESSON NOTES WIE VIEL KOSTET DAS? In this episode of Coffee Break German we ll start by learning the numbers from zero to ten and then learn to deal with transactional situations involving paying for things
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationTurkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr
More informationNoSta-D: A Corpus of German Non-standard Varieties
NoSta-D: A Corpus of German Non-standard Varieties Stefanie Dipper 1, Anke Lüdeling 2, Marc Reznicek 2 Ruhr-Universität Bochum 1 Humboldt-Universität zu Berlin 2 Abstract Until recently, most research
More informationAnnotation and Evaluation of Swedish Multiword Named Entities
Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se Introduction
More informationSentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
More informationChallenges of Cloud Scale Natural Language Processing
Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent
More informationThe SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
More informationPartially Supervised Word Alignment Model for Ranking Opinion Reviews
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-4 E-ISSN: 2347-2693 Partially Supervised Word Alignment Model for Ranking Opinion Reviews Rajeshwari
More informationExemplar for Internal Achievement Standard. German Level 1
Exemplar for Internal Achievement Standard German Level 1 This exemplar supports assessment against: Achievement Standard 90885 Interact using spoken German to communicate personal information, ideas and
More informationTopics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment
Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer
More informationSYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
More informationChapter 5. Phrase-based models. Statistical Machine Translation
Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many
More informationCS 533: Natural Language. Word Prediction
CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed
More informationTS3: an Improved Version of the Bilingual Concordancer TransSearch
TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More informationSemantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
More informationTowards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
More informationWhy Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?
Why Evaluation? How good is a given system? Machine Translation Evaluation Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better?
More informationHybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
More informationNatural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationInteractive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
More informationReliable and Cost-Effective PoS-Tagging
Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve
More informationAn Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
More informationWie ist das Wetter? (What s the weather like?)
Prior Knowledge: It is helpful if children already know the numbers 1-10; the months of the year; negative numbers; how to read simple scales and thermometers Objectives Explore the patterns and sounds
More informationSimple maths for keywords
Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd adam@lexmasterclass.com Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all
More informationMachine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!
Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult
More informationResearch Assistant in the Research Group: Diversity and Inclusion, Faculty of Human Sciences, University of Potsdam.
Sabrina Gerth Research Group: Diversity and Inclusion Human Sciences Faculty University of Potsdam Karl-Liebknecht-Str. 24-25 D-14476 Potsdam / Golm phone: ++49 (0)331-977-2758 email: sabrina.gerth@uni-potsdam.de
More informationHow To Teach English To Other People
TESOL / NCATE Program Standards STANDARDS FOR THE ACCREDIATION OF INITIAL PROGRAMS IN P 12 ESL TEACHER EDUCATION Prepared and Developed by the TESOL Task Force on ESL Standards for P 12 Teacher Education
More informationEfficient Data Integration in Finding Ailment-Treatment Relation
IJCST Vo l. 3, Is s u e 3, Ju l y - Se p t 2012 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Efficient Data Integration in Finding Ailment-Treatment Relation 1 A. Nageswara Rao, 2 G. Venu Gopal,
More informationCLARIN (in the) UK Tools and Services
CLARIN (in the) UK Tools and Services Johann Petrak (johann.petrak@sheffield.ac.uk) substituting for Wim Peters (w.peters@sheffield.ac.uk), Martin Wynne (martin.wynne@it.ox.ac.uk) 1 Clarin-UK consortium:
More informationSelf-Training for Parsing Learner Text
elf-training for Parsing Learner Text Aoife Cahill, Binod Gyawali and James V. Bruno Educational Testing ervice, 660 Rosedale Road, Princeton, NJ 0854, UA {acahill, bgyawali, jbruno}@ets.org Abstract We
More informationAn Online Service for SUbtitling by MAchine Translation
SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2011 Editor(s): Contributor(s): Reviewer(s): Status-Version: Volha Petukhova, Arantza del Pozo Mirjam
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationPart of Speech Tagging Bilingual Speech Transcripts with Intrasentential Model Switching
Analyzing Microtext: Papers from the 2013 AAAI Spring Symposium Part of Speech Tagging Bilingual Speech Transcripts with Intrasentential Model Switching Paul Rodrigues University of Maryland Center for
More informationIdentifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of
More informationANALEC: a New Tool for the Dynamic Annotation of Textual Data
ANALEC: a New Tool for the Dynamic Annotation of Textual Data Frédéric Landragin, Thierry Poibeau and Bernard Victorri LATTICE-CNRS École Normale Supérieure & Université Paris 3-Sorbonne Nouvelle 1 rue
More informationGrowing Strong Nonfiction Readers and Writers What Matters Most in and out of Class Presenter: Mary Ehrenworth
Parents as reading and writing partners: A day to help parents understand the literacy work their children are doing in school, and what to do at home to help their children grow and achieve at the highest
More informationPotsdam Commentary Corpus 2.0: Annotation for Discourse Research
Potsdam Commentary Corpus 2.0: Annotation for Discourse Research Manfred Stede, Arne Neumann Applied Computational Linguistics EB Cognitive Science Universität Potsdam / Germany stede@uni-potsdam.de, arne.neumann@uni-potsdam.de
More informationAUTOLEX: An Automatic Lexicon Builder for Minority Languages Using an Open Corpus
PACLIC 24 Proceedings 63 AUTOLEX: An Automatic Lexicon Builder for Minority Languages Using an Open Corpus Evan Liz C. Buhay a, Marie Joy P. Evardone a, Hansel B. Nocon a, Davis Muhajereen D. Dimalen a,
More informationA Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
More information