NoSta-D: A Corpus of German Non-standard Varieties
|
|
|
- Priscilla Terry
- 10 years ago
- Views:
Transcription
1 NoSta-D: A Corpus of German Non-standard Varieties Stefanie Dipper 1, Anke Lüdeling 2, Marc Reznicek 2 Ruhr-Universität Bochum 1 Humboldt-Universität zu Berlin 2 Abstract Until recently, most research in computational linguistics has been done on newspaper texts. Nowadays, the focus has been extended to other types of language data. This means that many linguistic descriptions and automatic tools need to be adapted or extended to non-newspaper language. The non-standard varieties corpus of German (NoSta-D) will provide a first gold standard for evaluation and training data of dependency analysis, named entity recognition and coreference resolution for out-of-domain text types. 1 Introduction Newspaper texts have been among the first texts that were available electronically. Even nowadays, they represent texts that can be accessed without any problem via the internet, and often adhere to a format that can be processed rather easily. As a consequence, investigations in corpus and computational linguistics have focussed mainly on newspaper texts. Only recently have research interests been extended to other text types. Due to that bias, current-state annotation tools and guidelines are tuned towards newspaper texts, which in turn has become the de facto standard variety. A growing pool of studies on different text types and varieties (e.g.
2 chat and blog data from the internet, learner data, historical texts) demonstrate the limits of the current systems. Many linguistic structures occurring in these non-standard varieties 1 are not covered by the tag sets and annotation schemes currently in use. In this short paper, we present a pilot corpus of non-standard varieties for German, NoSta-D. It has been compiled as part of the Clarin- D curation project Linguistic Annotation of Non-standard Varieties Guidelines and Best Practices at Humboldt-University Berlin and Ruhr- University Bochum. This data will be used to identify shortcomings of current guidelines and tools for dependency analysis, named entities and coreference annotations. One of the project results will be the corpus including gold standard annotations at all three annotation levels. It will be made freely available. The paper is structured as follows. In Section 2, we present the subcorpora that our corpus consists of. In Section 3, we address preprocessing steps and the different kinds of annotations that will be added to the corpus. Section 4 conludes the paper. 2 The Corpus We carefully selected five non-standard varieties with the aim as to cover a broad range of linguistic variation and non-standard phenomena: historical data, chat data, spoken data, learner data, and literary prose, see the overview in Table 1. All subcorpora stem from already existing research projects. 2 In addition, a subset of the newspaper corpus TüBa-D/Z has been included to provide a baseline for further annotation evaluations. We only included subcorpora that were free of copyright restrictions. 3. As the corpus has to be manually annotated, within a limited amount of time, the amount of text for each variety that will be annotated in the 1 The term standard is not intended as a normative concept, but refers to the de facto standard language found in newspaper texts. 2 The literary-prose corpus is new but covers the growing demand in ehumanities. 3 For licencing TüBa-D/Z, see tuebadz-license.html
3 first round is quite small (~ 300 sentences or utterances, ~ 7,000 tokens). Careful selection assures that the included passages and texts show a high rate of interesting linguistic structures. xxx hier sollten wir zu allen Subkorpora Refs haben! xxx Subcorpus Variety # Tokens Provider 1 DDB historical 2,348 Berlin Anselm Corpus historical 4,705 Bochum 2 Dortmunder Chat Corpus chat 6,664 Dortmund 3 Bematac spoken ~7,000 Berlin[3] 4 Falko learner 6,762 Berlin 5 Kafka: Der Prozeß literary prose 7,294 Tübingen 6 Tüba-D/Z newspaper ~7,000 Tübingen Table 1: NoSta-D corpus design: the subcorpora 2.1 Formats In conformity with the proposals of the ISO Technical group TC37/SC4 4, all data are stored in stand-off formats to allow for unrestricted later addition of annotations. As part of the curation project, converters are being developed to assure interchangeability between the TCF format used in the Clarin-D webservice environment Weblicht (Hinrichs et al. [4]) and the manual annotation tool WebAnno (xx hier eine Ref? xxx) on the one hand, and more generic corpus formats such as PAULA (Chiarcos et al. [2] for storage and relannis for the corpus search tool ANNIS (Zeldes et al. [14]) on the other hand. 4
4 3 Preprocessing and Annotations 3.1 Tokenization The very first processing step consists of marking word and sentence boundaries. Current tokenizers usually cannot deal with many spelling phenomena of non-standard data. For instance, chat data contain emoticons and other types of special symbols in various forms, see Ex. (1). Sentence boundary detection is especially difficult with historical data. They either do not use any punctuation mark, or else they use punctuations to indicate some kind of phrase boundary, see Ex. (2). Furthermore, word boundaries in historical data also diverge from modern boundaries. For instance, wiltu corresponds to the modern sequence willst du want you. (1) a. LANTOOO :))) b. tag quaki : ) c. *mal guck wo quaki sich nu hinstelt*g* (2) Wiltu nu gvter menſche eynen guten bowm ſeen vnd wiltu gute frucht an dyner zele brengen ſo ſalt u dich vben an guten werken If you good human want to see a good tree and if you want to bring good fruit to your soul then you should exercise in good deeds 3.2 Normalization Data used as training or evaluation data in computational linguistics must be consistently annotated. Hence, finding ways to assure consistent decisions for inconsistent data is an important issue (see Balsa et al. [1]). Ex (3) shows an example from the chat data. In standardized German, the preposition mit with selects dative case. In the chat data, mit occurs with accusative case (of course, such typos can also occur in newspaper texts). The example shows the original chat data (Chat), along with a normalized version (Norm) and an English gloss (Gloss).
5 (3) Chat ich versteh mich mit jeden[acc] man Norm Ich verstehe mich mit jedem[dat] Mann Gloss I understand myself with any man An obvious way to deal with such errors would be to relax the condition on case. For instance, annotators would be told to annotate the NP following the preposition mit always as the object of the preposition, regardless of the NP s case. In many cases, though, relaxation of grammatical restrictions is not a solution because this can lead to multiple competing interpretations. This issue is known from learner corpus research (see Lüdeling et al. [6]). Consider Ex. (4). This sentence, which is ungrammatical, can be fixed in two ways: In option Norm1, dass is considered an orthographic variant of das, thus providing the obligatory article of the count noun examen exam (which is missing). In this case, however, the word order is marked: in the unmarked order, the adverb morgen tomorrow would follow the verb kommen come. In the alternative option Norm2, dass is considered a subordinate conjunction. Then the last three tokens would have to be switched and the obligatory article would be missing. Elements normalized as described have been put in italics in the example. (4) Learner Ich denke, dass examen soll kommen morgen Gloss I think that exam should tomorrow come Norm1 Ich denke, das Examen soll morgen kommen Norm2 Ich denke, dass das Examen morgen kommen soll Obviously, the choice of normalization makes an important difference for dependency parsing. Providing just one robust interpretation of such data counteracts all efforts in variationist linguistics to contrast those different expressions. In such cases we will therefore include different normalizations to make ambiguity visible (for competing target hypotheses in learner language see Reznicek et al. [11]).
6 3.3 Annotations Where necessary, data will be normalized before further annotation takes place. Next, the data will be automatically POS-tagged and lemmatized (applying TreeTagger [12] and RFTagger [13]) and manually corrected. For further annotation levels, we selected levels that (i) represent core tasks of computational linguistics, (ii) would provide us with interesting non-standard phenomena, and (iii) illustrate different data structures of annotation: sentence-internal pointer relations for dependency annotations, span-based annotations for named entities, and cross-sentential pointers for coreference annotations. Dependency relations Comparative studies on syntactic annotations have shown that languages with relatively free word order can be described more accurately with dependency relations than with constituent structures (e.g. Nivre et al. [8]). This holds for German as well (e.g. Kübler & Prokic [5]). Dependency relations can deal more flexibly with word-order variation than constituent structures since they do not rely on adjacency. This should make them suitable for the annotation of data from non-standard varieties, such as Ex. (4). One of the goals of our project is to investigate to what extent dependency theory is able to deal with the broad range of variation that we observe in NoSta-D. Coreference In certain varieties of non-standard language, coreference annotation faces special problems. For instance, certain topic constituents in spoken language can be dropped, cf. Ex. (5). xxx hier ein echtes Bsp? xxx (5) A: Warst du auf dem Konzert von Lu Hang? B: Kenn ich nicht. A: Have you been to the concert by Lu Hang? B: (That guy) I don t know. Named Entities Finally, we annotate named entities as an instance of span-based annotation. We chose this xxxx hier fiel mir nichts ein. Sind
7 chat-daten speziell hier? z.b. weil die Teilnehmer auch immer in der 3.Person genannt werden vom Chat-System? ( Emon betritt den Raum ) 3.4 Querying in ANNIS3 As has been mentioned in Section 2.1, the corpus will be searchable via the corpus search tool ANNIS. In its third version 5, a range of features that are characteristic of non-standard varieties, will be covered, e.g. overlapping token layers of multiple speakers in spoken data, or competing tokenizations (e.g. historical vs. modern word boundaries in historical data). The screenshot in?? shall give an outlook of the corpus in its final version. Hier ein Falko-Beispiel mit allen Annotationen bauen 4 Conclusion In this paper, we have presented NoSta-D, a corpus of German non-standard varieties. The goal of our project is (i) to come up with a (pilot) reference 5 ANNIS3 will be released in short: annis/
8 corpus of non-standard data, (ii) to test the coverage of current annotation guidelines or linguistic descriptions, and (iii) to evaluate the performance of state-of-the-art tools. Based on the experiences that we will make during this enterprise, we will come up with extended guidelines and best practices as to how to deal with such data. References [1] Balsa, J.; Lopes, G. (2000) A Distributed Approach for a Robust and Evolving NLP System. Christodoulakis, D.N. (Ed.): Proceedings of the 2nd International Conference on Natural Language Processing (NLP 2000). Patras, Greece. Berlin [u.a.]: Springer, [2] Chiarcos, C.; Dipper, S.; Götze, M.; Ritz, J.; Stede, M. (2008) A Flexible Framework for Integrating Annotations from Different Tools and Tagsets. Proceeding of the Conference on Global Interoperability for Language Resources, Hong Kong. [3] Giesel,L.; Klapi,M; Krüger,D.;Nunberger,I.;Rasskazova,O.; Sauer,S. (to appear) Berlin Map Task Corpus (BeMaTaC):Aufbau eines L1- Vergleichskorpus zur Untersuchung gesprochener Lernersprache. Poster to be published at CL-Postersession, DFGS 2013, Potsdam. [4] Hinrichs,M.;Zastrow,T.; Hinrichs,E. (2011) WebLicht: Web-based LRT Services in a Distributed escience Infrastructure. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), Valetta, Malta:ELRA. [5] Kübler, S.; Prokic, J. (2006) Why is German Dependency Parsing more Reliable than Constituent Parsing? Proceedings of the 5th International Workshop on Treebanks and Linguistic Theories (TLT 2006). Prague, Czech Republic. [6] Lüdeling, A.; Doolittle, S.; Hirschmann, H.; Schmidt, K.; Walter, M. (2008) Das Lernerkorpus Falko. DAF 45 (2),
9 [7] Mitkov, Ruslan (2008) Corpora for Anaphora and Coreference Resolution. Lüdeling, A.;Kytö, M. (Eds.): Corpus Linguistics: An international Handbook. Berlin; New York: Mouton de Gruyter (Handbooks of Linguistics and Communication Science, 29,1), [8] Nivre, J.; Nilsson, J.; Hall, J.; Chanev, A.; Eryigit, G.; Kübler, S. (2007) MaltParser: A Language-Independent System for Data- Driven Dependency Parsing Natural Language Engineering 13 (1), [9] Rayson, P.; Stevenson, M. (2008) Sense and Semantic Tagging. Lüdeling, A.;Kytö, M. (Eds.): Corpus Linguistics: An international Handbook. Berlin; New York:Mouton de Gruyter (Handbooks of Linguistics and Communication Science, 29,1), [10] Reznicek, M.; Lüdeling, A.; Schwantuschke, F. (2012) Das Falko-Handbuch: Korpusaufbau und Annotationen. Version 2.0. Institut für deutsche Sprache und Linguistik, Humboldt-Universität zu Berlin. ng/falko/falko-handbuch_korpusaufbau%20und%20annotationen_v2.01.pdf, [11] Reznicek, M.; Lüdeling, A.; Hirschmann, H. (to appear) Competing Target Hypotheses in the Falko Corpus: A Flexible Multi-Layer Corpus Architecture. DÃÂaz-Negrillo, A. (Ed.): Automatic Treatment and Analysis of Learner Corpus Data: John Benjamins. [12] Schmid, H. (1994) Probabilistic Part-of-Speech Tagging: Using Decision Trees. Proceedings of the International Conference on New Methods in Language Processing, [13] Schmid, H.; Laws, F. (2008) Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-grained POS Tagging. Donia Scott (Ed.): 22nd International Conference on Computational Linguistics. Coling 2008, Stroudsburg, Pa: Association for Computational Linguistics,
10 [14] Zeldes, A.; Ritz, J.; Lüdeling, A.; Chiarcos, C. (2009) ANNIS: A Search Tool for Multi-layer Annotated Corpora. Mahlberg,M; Gonzxxxlez-Dxxxaz, V.; Smith,C. (Eds.): Proceedings of Corpus Linguistics. Liverpool:University of Liverpool. [15] Zipser F. (2009) Entwicklung eines Konverterframeworks für linguistisch annotierte Daten auf Basis eines gemeinsamen (Meta-)modells. Diploma dissertation, Humboldt-Universität zu Berlin.
CorA: A web-based annotation tool for historical and other non-standard language data
CorA: A web-based annotation tool for historical and other non-standard language data Marcel Bollmann, Florian Petran, Stefanie Dipper, Julia Krasselt Department of Linguistics Ruhr-University Bochum,
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv 1. The Metadata Format CMDI Metadata? Metadata Format? and more Metadata? Metadata Format?
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Search Engines Chapter 2 Architecture. 14.4.2011 Felix Naumann
Search Engines Chapter 2 Architecture 14.4.2011 Felix Naumann Overview 2 Basic Building Blocks Indexing Text Acquisition Text Transformation Index Creation Querying User Interaction Ranking Evaluation
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
Exemplar for Internal Achievement Standard. German Level 1
Exemplar for Internal Achievement Standard German Level 1 This exemplar supports assessment against: Achievement Standard 90885 Interact using spoken German to communicate personal information, ideas and
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language Thomas Schmidt Institut für Deutsche Sprache, Mannheim R 5, 6-13 D-68161 Mannheim [email protected]
Exemplar for Internal Assessment Resource German Level 1. Resource title: Planning a School Exchange
Exemplar for internal assessment resource German 1.5A for Achievement Standard 90887! Exemplar for Internal Assessment Resource German Level 1 Resource title: Planning a School Exchange This exemplar supports
German Language Resource Packet
German has three features of word order than do not exist in English: 1. The main verb must be the second element in the independent clause. This often requires an inversion of subject and verb. For example:
Machine Learning for natural language processing
Machine Learning for natural language processing Introduction Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 13 Introduction Goal of machine learning: Automatically learn how to
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
Corpuslinguistics Annis 2 -Corpus query tool searching in Falko
Corpuslinguistics Annis 2 -Corpus query tool searching in Falko Marc Reznicek, Anke Lüdeling, Amir Zeldes, Hagen Hirschmann [email protected]... ánd many more members of the HU corpus team
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
Topological Field Chunking in German
Topological Field Chunking in German Jorn Veenstra, Frank H. Müller, Tylman Ule [veenstra,fhm,ule]@sfs.uni-tuebingen.de ESSLLI Summerschool Workshop on Machine Learning Approaches in Computational Linguistics
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
All the English here is taken from students work, both written and spoken. Can you spot the errors and correct them?
Grammar Tasks This is a set of tasks for use in the Basic Grammar Class. Although they only really make sense when used with the other materials from the course, you can use them as a quick test as the
Automatic Detection and Correction of Errors in Dependency Treebanks
Automatic Detection and Correction of Errors in Dependency Treebanks Alexander Volokh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany [email protected] Günter Neumann DFKI Stuhlsatzenhausweg
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus Michael Beißwenger, Eric Ehrhardt, Andrea Horbach, Harald Lüngen, Diana Steffen, Angelika Storrer
Name: Klasse: Datum: A. Was wissen Sie schon? What do you know already from studying Kapitel 1 in Vorsprung? True or false?
KAPITEL 1 Jetzt geht s los! Vor dem Anschauen A. Was wissen Sie schon? What do you know already from studying Kapitel 1 in Vorsprung? True or false? 1. Man sagt Hallo in Deutschland. 2. Junge Personen
CLARIN project DiscAn :
CLARIN project DiscAn : Towards a Discourse Annotation system for Dutch language corpora Ted Sanders Kirsten Vis Utrecht Institute of Linguistics Utrecht University Daan Broeder TLA Max-Planck Institute
bound Pronouns
Bound and referential pronouns *with thanks to Birgit Bärnreuther, Christina Bergmann, Dominique Goltz, Stefan Hinterwimmer, MaikeKleemeyer, Peter König, Florian Krause, Marlene Meyer Peter Bosch Institute
FOR TEACHERS ONLY The University of the State of New York
FOR TEACHERS ONLY The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION G COMPREHENSIVE EXAMINATION IN GERMAN Friday, June 15, 2007 1:15 to 4:15 p.m., only SCORING KEY Updated information
AP WORLD LANGUAGE AND CULTURE EXAMS 2012 SCORING GUIDELINES
AP WORLD LANGUAGE AND CULTURE EXAMS 2012 SCORING GUIDELINES Interpersonal Writing: E-mail Reply 5: STRONG performance in Interpersonal Writing Maintains the exchange with a response that is clearly appropriate
An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models
Dissertation (Ph.D. Thesis) An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Christian Siefkes Disputationen: 16th February
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Processing Dialogue-Based Data in the UIMA Framework. Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg
Processing Dialogue-Based Data in the UIMA Framework Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg Overview Background Processing dialogue-based Data Conclusion Gnjatović, Kunze,
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Transcriptions in the CHAT format
CHILDES workshop winter 2010 Transcriptions in the CHAT format Mirko Hanke University of Oldenburg, Dept of English [email protected] 10-12-2010 1 The CHILDES project CHILDES is the CHIld Language
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
Coffee Break German. Lesson 09. Study Notes. Coffee Break German: Lesson 09 - Notes page 1 of 17
Coffee Break German Lesson 09 Study Notes Coffee Break German: Lesson 09 - Notes page 1 of 17 LESSON NOTES ICH SPRECHE EIN BISSCHEN DEUTSCH In this lesson you will learn how to deal with language problems
Satzstellung. Satzstellung Theorie. learning target. rules
Satzstellung learning target Aim of this topic is to explain how to arrange the different parts of a sentence in the correct order. I must admit it took quite a long time to handle this topic and find
Building gold-standard treebanks for Norwegian
Building gold-standard treebanks for Norwegian Per Erik Solberg National Library of Norway, P.O.Box 2674 Solli, NO-0203 Oslo, Norway [email protected] ABSTRACT Språkbanken at the National Library of Norway
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN
HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN Yu Chen, Andreas Eisele DFKI GmbH, Saarbrücken, Germany May 28, 2010 OUTLINE INTRODUCTION ARCHITECTURE EXPERIMENTS CONCLUSION SMT VS. RBMT [K.
Modalverben Theorie. learning target. rules. Aim of this section is to learn how to use modal verbs.
learning target Aim of this section is to learn how to use modal verbs. German Ich muss nach Hause gehen. Er sollte das Buch lesen. Wir können das Visum bekommen. English I must go home. He should read
Annotation and Evaluation of Swedish Multiword Named Entities
Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden [email protected] Introduction
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations Seid Muhie Yimam 1,3 Iryna Gurevych 2,3 Richard Eckart de Castilho 2 Chris Biemann 1 (1) FG Language Technology,
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus Michael Beißwenger 1, Eric Ehrhardt 2, Andrea Horbach 3, Harald Lüngen 4, Diana Steffen 3, Angelika
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects
Curriculum Vitae Dr. Amir Zeldes
Curriculum Vitae Dr. Amir Zeldes Postal address: Humboldt-Universität zu Berlin Institut für deutsche Sprache und Linguistik Unter den Linden 6 10099 Berlin Germany Telephone (office): +49 30 2093 9720
LEJ Langenscheidt Berlin München Wien Zürich New York
Langenscheidt Deutsch in 30 Tagen German in 30 days Von Angelika G. Beck LEJ Langenscheidt Berlin München Wien Zürich New York I Contents Introduction Spelling and pronunciation Lesson 1 Im Flugzeug On
Abstract Anaphors in German and English
Abstract Anaphors in German and English Stefanie Dipper 1, Christine Rieger 2, Melanie Seiss 2, and Heike Zinsmeister 2 1 Ruhr-University Bochum, 44780 Bochum, Germany 2 University of Konstanz, 78457 Konstanz,
The finite verb and the clause: IP
Introduction to General Linguistics WS12/13 page 1 Syntax 6 The finite verb and the clause: Course teacher: Sam Featherston Important things you will learn in this section: The head of the clause The positions
Coffee Break German. Lesson 03. Study Notes. Coffee Break German: Lesson 03 - Notes page 1 of 15
Coffee Break German Lesson 03 Study Notes Coffee Break German: Lesson 03 - Notes page 1 of 15 LESSON NOTES ICH KOMME AUS DEUTSCHLAND. UND SIE? In this lesson of Coffee Break German we will learn to talk
Scheme of Work: German - Form 3
Scheme of Work: German - Form 3 Textbook: Schritte International 2 and 3 1 st Term Dates Aims and Objectives No. of 26 th Sep 5 th Nov Unit 13: Neue Kleider Vocabulary: Learning vocabulary regarding clothes
A Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA [email protected], [email protected]
Varieties of specification and underspecification: A view from semantics
Varieties of specification and underspecification: A view from semantics Torgrim Solstad D1/B4 SFB meeting on long-term goals June 29th, 2009 The technique of underspecification I Presupposed: in semantics,
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
SYNTAX AND SEMANTICS OF CAUSAL DENN IN GERMAN TATJANA SCHEFFLER
SYNTAX AND SEMANTICS OF CAUSAL DENN IN GERMAN TATJANA SCHEFFLER Department of Linguistics University of Pennsylvania [email protected] This paper presents a new analysis of denn (because) in German.
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400
Projektgruppe. Information Extraction An Incomplete Overview
Projektgruppe Henning Wachsmuth Information Extraction An Incomplete Overview 12. Mai 2010 1 Einführungsvorträge Verfassen von Seminarvortrag und paper Prof. Dr. Gregor Engels, Donnerstag 15.4., 16h-18h
Breatling the Meaning of Tag Sets in CMC corpora
An extended tag set for annotating parts of speech in CMC corpora Thomas Bartz 1, Michael Beißwenger 1, Eric Ehrhardt 2, Angelika Storrer 2 1) 2) International Research Days: Social Media and CMC Corpora
LINGUISTIC SUPPORT IN "THESIS WRITER": CORPUS-BASED ACADEMIC PHRASEOLOGY IN ENGLISH AND GERMAN
ELN INAUGURAL CONFERENCE, PRAGUE, 7-8 NOVEMBER 2015 EUROPEAN LITERACY NETWORK: RESEARCH AND APPLICATIONS Panel session Recent trends in Bachelor s dissertation/thesis research: foci, methods, approaches
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
Checklist Use this checklist to find out how much English you already know. Grundstufe 1 (Common European Framework: A1 Level)
Der XL Test: Was können Sie schon? Schätzen Sie Ihre Sprachkenntnisse selbst ein! Sprache: Englisch Mit der folgenden e haben Sie die Möglichkeit, Ihre Fremdsprachenkenntnisse selbst einzuschätzen. Die
Reference Determination for Demonstrative Pronouns
Peter Bosch & Carla Umbach University of Osnabrück, Germany This paper discusses results from a corpus study of German demonstrative and personal pronouns and from a reading time experiment in which we
Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models
p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
QCF Qualifications in Languages. German. Level 1
QCF Qualifications in Languages German For teaching from September 2012 SPOKEN GERMAN: Communicating Personal Information Assessment Criteria 1.1 and 1.2 Task: Meeting New People CC- BY-SA 2007 Miraceti
German for beginners in 7 lessons
Olena Shypilova German for beginners in 7 lessons Study course 2012 German for beginners in 7 Lessons Thank you for choosing and joining our on-line German course. The course consists of 7 lessons. Due
Scheme of Work: German - Form 3 2015/16
Scheme of Work: German - Form 3 2015/16 Textbook: Schritte International 2 and 3 Four per week 1 st Term Dates Aims and Objectives No. of Lessons 28 th Sept - 12 th Oct Unit 12: Der Kunde ist König (cont.)
Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing
Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing Peter Bouda, Vera Ferreira, António Lopes Centro Interdisciplinar de Documentação Linguística e Social
Curriculum Map. Discipline: Foreign Language Course: German 7-8 Weighted Teacher: Julie Bequette/Tom Neal
Curriculum Map Discipline: Foreign Language Course: German 7-8 Weighted Teacher: Julie Bequette/Tom Neal August/September: 28 A 3a comprehend oral-audio presentations 28 A 3b follow multistep instructions
Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Linguistics to Structure Unstructured Information
Linguistics to Structure Unstructured Information Authors: Günter Neumann (DFKI), Gerhard Paaß (Fraunhofer IAIS), David van den Akker (Attensity Europe GmbH) Abstract The extraction of semantics of unstructured
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
1003 Inhaltsverzeichnis
1003 Einführung - Freizeit und Hobbys Springen wir ins Thema 2 How do people spend their freetime? What sorts of hobbies and activities do you enjoy? 1 Vokabelkasten 4 Introduction to the Vokabelkasten
Factored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang [email protected], [email protected] School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
Integrating Annotation Tools into UIMA for Interoperability
Integrating Annotation Tools into UIMA for Interoperability Scott Piao, Sophia Ananiadou and John McNaught School of Computer Science & National Centre for Text Mining The University of Manchester UK {scott.piao;sophia.ananiadou;john.mcnaught}@manchester.ac.uk
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna)
Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna) VII Conference on Legal Translation, Court Interpreting and Comparative Legilinguistics Poznań, 28-30.06.2013
Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles
Why language is hard And what Linguistics has to say about it Natalia Silveira Participation code: eagles Christopher Natalia Silveira Manning Language processing is so easy for humans that it is like
Julia Englert, PhD Student. Curriculum Vitae
Julia Englert, PhD Student Curriculum Vitae Name: Nationality: Julia Valerie Englert German Date of Birth: April 14 th 1987 E-Mail: [email protected] Phone 0049-681-302-68563 Office Address: Saarland
CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature
CS4025: Pragmatics Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature For more info: J&M, chap 18,19 in 1 st ed; 21,24 in 2 nd Computing Science, University of
The Use of Text Corpora in Lexical Research
The Use of Text Corpora in Lexical Research Stefan Engelberg Workshop, Universitatea din Bucureşti, November 2008 http://www.ids-mannheim.de/ll/lehre/engelberg/ Webseite_CorpLex/CorpLex.html [email protected]
Annotation Guidelines for Dutch-English Word Alignment
Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College
The ALeSKo learner corpus: Design annotation quantitative. [short title for running heads: The ALeSKo learner corpus]
1 The ALeSKo learner corpus: Design annotation quantitative analyses [short title for running heads: The ALeSKo learner corpus] Heike Zinsmeister University of Konstanz Linguistics Department Box 185 78457
