Breatling the Meaning of Tag Sets in CMC corpora
|
|
|
- Madlyn Anderson
- 5 years ago
- Views:
Transcription
1 An extended tag set for annotating parts of speech in CMC corpora Thomas Bartz 1, Michael Beißwenger 1, Eric Ehrhardt 2, Angelika Storrer 2 1) 2) International Research Days: Social Media and CMC Corpora for the ehumanities Journées Internationales de recherche «Médias sociaux et corpus de communication médiée par les réseaux. Annotation, analyse, données libres» octobre 2015
2 Part-of-spech tagging for CMC corpora Without a part-of-speech (PoS) annotation: only very limited querying options; no basis for advanced processing steps which require a useful linguistic preprocessing (e.g., parse trees). The Problem: Part-of-speech taggers (NLP tools in general) do not perform very well on written CMC discourse: new elements which don t fit into any established PoS category (emoticons, addressings, action words, hashtags); speedwriing phenomena (typos, omission of characters, norm-deviating use of whitespace); colloquial (Wazzup?) and creative spellings (nyce2meetu)
3 The problem A: Rufst an wenn du köpenick bist! B: Ja B: Wir sehn uns ja gleich A: Jo B: Ersatzverkejr B: Ich hab keine ahnubg wo der hinfährr-.- I have no idea, where it [that one] is going to-.-
4 The problem A: Rufst an wenn du köpenick bist! B: Ja B: Wir sehn uns ja gleich A: Jo B: Ersatzverkejr B: Ich hab keine ahnubg wo der hinfährr-.- I have no idea, where it [that one] is going to-.- WebLicht- Toolchain:
5 The problem A: Rufst an wenn du köpenick bist! B: Ja B: Wir sehn uns ja gleich A: Jo B: Ersatzverkejr B: Ich hab keine ahnubg wo der hinfährr-.- I have no idea, where it [that one] is going to-.- WebLicht- Toolchain:
6 The problem A: Rufst an wenn du köpenick bist! B: Ja B: Wir sehn uns ja gleich A: Jo B: Ersatzverkejr B: Ich hab keine ahnubg wo der hinfährr-.- I have no idea, where it [that one] is going to-.-
7 The problem Problems on several levels of the processing process: Tokenization problems: The tokens created in the tokenization step do not represent relevant units of the linguistic structure (e.g., due to speedwriting phenomena) Categorization problems: There s an adequate tag in the tag set but the tagger can t assign it (e.g., in the case of norm-deviating colloquial & dialect spellings) Category problems: The tagger can t assign an adequate tag because there s no adequate tag in the tag set (e.g., for emoticons, action words, addressings, hashtags, clitics which are typical of dialogical language in informal registers ) Cf. Bartz et al. (2014)
8 Ways to solve the problem Variant A: Normalization PoS tagging with standard tools Open issues: 1) categories for elements that are missing in PoS tagsets for edited text 2) adapt tools for automatic normalization Variant B: No normalization; PoS tagging of the original data Open issues: 1) categories for elements that are missing in PoS tagsets for edited text 2) improve tokenizers & taggers
9 Designing a basic PoS tag set for German CMC Initiative in CLARIN-D ( ) for updating the canonical STTS through adapting it for genres which its original creators didn t have in focus (Zinsmeister et al. 2014) e.g.: - historical corpora - spoken language corpora - learner corpora -CMC Discussions in the DFG network Empirikom ( , on how to make NLP tools fit for automatically processing & annotating CMC corpora Idea: Let s set up a community shared task on NLP for CMC in order to encourage the developers of NLP tools to adapt their tools & tagging models for CMC (supported by GSCL)
10 What requirements should a basic PoS tag set for CMC meet? It should be compatible with established PoS tag sets interoperability with other (types of) corpora For categories which occur in CMC but which are not CMC-specific: try to be compatible with PoS categories in other (non-cmc) genres ( interoperability of corpora; interesting research questions) For categories which are specific to CMC: Keep it simple so that the use of the categories can easily be learned As long that there s no consensus in the linguistic community about how to integrate CMC elements into part-of-speech typologies: Don t try to install one (and force people to use it because they won t) instead, design your categories as theory-free as possible.
11 STTS 2.0 : A basic PoS tag set for German CMC Basis: The Stuttgart Tübingen Tagset (STTS): de-facto standard for German (focused on PoS tags for the language occuring in edited text / newspaper texts) (Schiller et al. 1999)
12 STTS 2.0 : A basic PoS tag set for German CMC Basis: The Stuttgart Tübingen Tagset (STTS): de-facto standard for German Structure of STTS tags: main category > subcategory
13 STTS 2.0 : A basic PoS tag set for German CMC Basis: The Stuttgart Tübingen Tagset (STTS): de-facto standard for German (focused on PoS tags for the language occuring in edited text / newspaper texts) (Schiller et al. 1999) STTS 2.0 : canonical STTS extended with new categories, but still downward-compatible with STTS (1999) Compatible with the extended STTS for spoken language which is used for PoS tagging the FOLK corpus of spoken German at IDS Mannheim (for phenomena which are not in the canonical STTS and which also occur in spoken language)
14 STTS 2.0 : A basic PoS tag set for German CMC Tag table:
15 STTS 2.0 : A basic PoS tag set for German CMC PoS tag Category Examples I. Tags for phenomena which are specific for CMC / social media discourse: EMO ASC ASCII emoticon :-) :-( ^^ O.O EMO IMG Graphic emoticon AKW Interaction word *lach*, freu, grübel, *lol* HST Hash tag Kreta war super! #urlaub ADR Addressing Wie isset so? URL Uniform resource locator EML address [email protected] II. Tags for phenomena which are typical for spontaneous spoken language in colloquial registers: VV PPER APPR ART VM PPER VA PPER KOUS PPER PPER PPER ADV ART Tags for types of colloquial contractions which are frequent in CMC (APPRART is already existing in STTS 1999) schreibste, machste vorm, überm, fürn willste, darfste, musste haste, biste, isses wenns, weils, obse ichs, dus, ers son, sone PTK IFG Intensitätspartikeln, Fokuspartikeln, Gradpartikeln sehr schön, höchst eigenartig, nur sie, voll geil PTK MA Modal particles Das ist ja / vielleicht doof. Ist das denn richtig so? Das war halt echt nicht einfach. PTK MWL Particle as part of a multi-word lexeme keine mehr, noch mal, schon wieder DM Discourse markers weil, obwohl, nur, also,... with V2 clauses ONO Onomatopoeia boing, miau, zisch
16 KOUS PPER wenns, weils, obse STTS 2.0 : A basic PoS tag set for German CMC PoS tag Category Examples I. Tags for phenomena which are specific for CMC / social media discourse: EMO ASC ASCII emoticon :-) :-( ^^ O.O EMO IMG Graphic emoticon AKW Interaction word *lach*, freu, grübel, *lol* HST Hash tag Kreta war super! #urlaub ADR Addressing Wie isset so? URL Uniform resource locator EML address [email protected] II. Tags for phenomena which are typical for spontaneous spoken language in colloquial registers: VV PPER APPR ART VM PPER VA PPER Tags for types of colloquial contractions which are frequent in CMC (APPRART is already existing in STTS 1999) schreibste, machste vorm, überm, fürn willste, darfste, musste haste, biste, isses
17 AKW Interaction word *lach*, freu, grübel, *lol* HST Hash tag Kreta war super! #urlaub ADR Addressing Wie isset so? URL Uniform resource locator EML address II. Tags for phenomena which are typical for spontaneous spoken language in colloquial registers: VV PPER APPR ART VM PPER VA PPER KOUS PPER PPER PPER ADV ART Tags for types of colloquial contractions which are frequent in CMC (APPRART is already existing in STTS 1999) schreibste, machste vorm, überm, fürn willste, darfste, musste haste, biste, isses wenns, weils, obse ichs, dus, ers son, sone PTK IFG Intensitätspartikeln, Fokuspartikeln, Gradpartikeln sehr schön, höchst eigenartig, nur sie, voll geil PTK MA Modal particles Das ist ja / vielleicht doof. Ist das denn richtig so? Das war halt echt nicht einfach. PTK MWL Particle as part of a multi-word lexeme keine mehr, noch mal, schon wieder DM Discourse markers weil, obwohl, nur, also,... with V2 clauses ONO Onomatopoeia boing, miau, zisch
18 Contractions in chats social chat subcorpus of the Dortmund chat corpus: 21 logfiles / tokens, including 584 occurrences of colloquial contractions
19 Tag set and annotation PoS tagset + annotation guidelines available on the website of the GSCL/ Empirikom shared task on automatic linguistic annotation of CMC (EmpiriST2015).
20 ChatCorpus2CLARIN: Project background Curation project of the CLARIN-D F-AG 1 German Philology Duration: May 2015 February 2016 Project team: Michael Beißwenger (U Dortmund), Angelika Storrer, Eric Ehrhardt (U Mannheim), Harald Lüngen (IDS), Axel Herold (BBAW) + other colleagues at IDS and BBAW The task: Re-modeling of the Dortmund Chat Corpus and samples of other CMC resources compliant with existing standards for the representation of corpora in the Digital Humanities. Integration into the CLARIN-D infrastructures at BBAW and IDS. Main goal: Pave the way for the inclusion of linguistically annotated CMC resources into the CLARIN-D corpus infrastructures and create the prerequisites for investigating linguistic peculiarities of CMC with state-of-the art corpus technology.
21 ChatCorpus2CLARIN: Project background Curation project of the CLARIN-D F-AG 1 German Philology de/kurationsprojekt-1-3-germanistik
22 The corpus Dortmund Chat Corpus logfile documents with 140,240 user postings or 1M words of German chat discourse. Resource for the analysis of linguistic variation in chats including chats from different social/institutional contexts (social chats, advisory chats, learning and teaching, moderated chats in the media context). Annotated in a home-grown XML format ( ChatXML ): (1) basic structure of chat logfiles and postings, (2) selected CMC phenomena, (3) selected metadata.
23 Other corpora / data sets in the project focus German WhatsApp Corpus ( What's up, Deutschland? ) German Wikipedia corpus in DeReKo German News Corpus in DeReKo DWDS Blog Corpus DWDS German Reference Corpus of CMC (DeRiK)
24 Work packages in the project - TEI representation ( CLARIN-D schema ) - CLARINification, legal issues + licensing - enrich the data with additional linguistic annotations (PoS, normalised spellings,...)
25 The vision After its integration into the CLARIN-D infrastructure the resource will be characterized by the following added values: Advanced accessibility and retrieval options; interoperability with other corpus resources that are represented in TEI and with annotation and analysis tools that support the TEI format; advanced querying options (PoS tags, normalized spellings); interoperability with other corpus resources that have been tagged with STTS; advanced options for corpus-based analyses on the peculiarities of CMC discourse as compared to the language of edited text and of spoken language, using the text and speech corpora which are already available in the corpus infrastructures of BBAW and IDS.
26 PoS annotation of the corpus: workflow 1. Automatic tokenisation, PoS annotation & lemmatisation of the chat corpus with tools + tagging models from the BMBF project Schreibgebrauch at U Saarbrücken (Horbach et al. 2014, Horbach et al. 2015) PoS tag set: previous version of STTS 2.0 (Bartz et al. 2014) Representation of the tagging results as additions to the ChatXML format. Standard PoS PoS taggers: Accuracy on on Chat Chat Corpus: ~71% ~71% (vs. (vs. 97% 97% accurracy on on Newspaper) Tagging models models from from the the Schreibgebrauch project: project: Average accuracy on on Chat Chat Corpus: 83.5% 83.5% 2. Manual post-processing of the tagging results using OrthoNormal in FOLKER (preview version 1.2) with an import/export filter for PoS tagged ChatXML (defined by Thomas Schmidt/IDS)
27 Manual post-processing of PoS tagging results with OrthoNormal (Overview of the FOLK tools: Schmidt 2012)
28 Using <w> for the representation of PoS information in our TEI schema <post type="standard" who="#a04" auto="false" rend="color:green"> <p> <w type="vvfin">dachte</w> <w type="pper">ich</w> < type="adv">auch</w> <w type="adv">immer</w> <w type="$(">,</w> <name type="nickname" corresp="#a09"> <w type="ne">monk</w> </name> CLARIN-D TEI schema (documentation): <w type="$.">..</w> <w type= $(">*</w> <w type="akw">heul</w> <w type= $(">*</w> </p> </post> CLARIN-D_schema_draft_for_ representing_cmc_in_tei_(2015) ineli26: dachte ich ich auch immer, monk.... *heul* I I was always thinking the the same, monk.... *crying*
29 References Bartz, Thomas; Beißwenger, Michael; Storrer, Angelika (2014): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. In: Journal for Language Technology and Computational Linguistics 28 (1), Beißwenger, Michael (2013): Das Dortmunder Chat-Korpus. In: Zeitschrift für germanistische Linguistik 41 (1), Extended version: Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2012): A TEI Schema for the Representation of Computer-mediated Communication. In: Journal of the Text Encoding Initiative (jtei) 3. (DOI: /jtei.476). Beißwenger, Michael; Bartz, Thomas; Storrer, Angelika; Westpfahl, Swantje (2015): Tagset und Richtlinie für das PoS-Tagging von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline Document, Dortmund Horbach, Andrea; Steffen, Diana; Thater, Stefan; Pinkal, Manfred (2014): Improving the Performance of Standard Part-of- Speech Taggers for Computer-Mediated Communication. Proceedings of KONVENS 2014, Horbach, Andrea; Thater, Stefan; Steffen, Diana; Fischer, Peter M.; Witt, Andreas; Pinkal, Manfred (2015): Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum 15 (1), Schiller, Anne; Teufel, Simone; Stöckert, Christine (1999): Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). University of Stuttgart: Institut für maschinelle Sprachverarbeitung. Schmidt, Thomas (2012): EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC 12), Istanbul, Turkey: European Language Resources Association (ELRA). Zinsmeister, Heike; Heid, Ulrich; Beck, Kathrin Beck (Eds., 2014): Das STTS-Tagset für Wortartentagging - Stand und Perspektiven. Special issue of the Journal for Language Technology and Computational Linguistics.
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus Michael Beißwenger, Eric Ehrhardt, Andrea Horbach, Harald Lüngen, Diana Steffen, Angelika Storrer
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus Michael Beißwenger 1, Eric Ehrhardt 2, Andrea Horbach 3, Harald Lüngen 4, Diana Steffen 3, Angelika
How To Write A German Reference Corpus Of Computer Mediated Communication
DeRiK: A German Reference Corpus of Computer-Mediated Communication Michael Beißwenger 1, Maria Ermakova 2, Alexander Geyken 2, Lothar Lemnitzer 2, Angelika Storrer 1 1 Department of German Language and
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language Thomas Schmidt Institut für Deutsche Sprache, Mannheim R 5, 6-13 D-68161 Mannheim [email protected]
NoSta-D: A Corpus of German Non-standard Varieties
NoSta-D: A Corpus of German Non-standard Varieties Stefanie Dipper 1, Anke Lüdeling 2, Marc Reznicek 2 Ruhr-Universität Bochum 1 Humboldt-Universität zu Berlin 2 Abstract Until recently, most research
Search Engines Chapter 2 Architecture. 14.4.2011 Felix Naumann
Search Engines Chapter 2 Architecture 14.4.2011 Felix Naumann Overview 2 Basic Building Blocks Indexing Text Acquisition Text Transformation Index Creation Querying User Interaction Ranking Evaluation
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
Exemplar for Internal Achievement Standard. German Level 1
Exemplar for Internal Achievement Standard German Level 1 This exemplar supports assessment against: Achievement Standard 90885 Interact using spoken German to communicate personal information, ideas and
An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models
Dissertation (Ph.D. Thesis) An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Christian Siefkes Disputationen: 16th February
Building and Annotating a Corpus of German-Language Newsgroups
Building and Annotating a Corpus of German-Language Newsgroups Jasmin Schröck Institut für Deutsche Sprache Mannheim [email protected] Harald Lüngen Institut für Deutsche Sprache Mannheim
Using German corpora for linguistic purposes. Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim
Using German corpora for linguistic purposes Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim Introduction This talk will give a first impression of the complex field of German corpora and methods
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
Processing Dialogue-Based Data in the UIMA Framework. Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg
Processing Dialogue-Based Data in the UIMA Framework Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg Overview Background Processing dialogue-based Data Conclusion Gnjatović, Kunze,
FOR TEACHERS ONLY The University of the State of New York
FOR TEACHERS ONLY The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION G COMPREHENSIVE EXAMINATION IN GERMAN Friday, June 15, 2007 1:15 to 4:15 p.m., only SCORING KEY Updated information
Name: Klasse: Datum: A. Was wissen Sie schon? What do you know already from studying Kapitel 1 in Vorsprung? True or false?
KAPITEL 1 Jetzt geht s los! Vor dem Anschauen A. Was wissen Sie schon? What do you know already from studying Kapitel 1 in Vorsprung? True or false? 1. Man sagt Hallo in Deutschland. 2. Junge Personen
GETTING FEEDBACK REALLY FAST WITH DESIGN THINKING AND AGILE SOFTWARE ENGINEERING
GETTING FEEDBACK REALLY FAST WITH DESIGN THINKING AND AGILE SOFTWARE ENGINEERING Dr. Tobias Hildenbrand & Christian Suessenbach, SAP AG Entwicklertag Karlsruhe, 22 May 2014 Ich wollte Mitarbeiter so motivieren,
Exemplar for Internal Assessment Resource German Level 1. Resource title: Planning a School Exchange
Exemplar for internal assessment resource German 1.5A for Achievement Standard 90887! Exemplar for Internal Assessment Resource German Level 1 Resource title: Planning a School Exchange This exemplar supports
Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna)
Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna) VII Conference on Legal Translation, Court Interpreting and Comparative Legilinguistics Poznań, 28-30.06.2013
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv 1. The Metadata Format CMDI Metadata? Metadata Format? and more Metadata? Metadata Format?
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects
A TEI Schema for the Representation of Computer-mediated Communication
A TEI Schema for the Representation of Computer-mediated Communication Michael Beißwenger 1, Maria Ermakova 2, Alexander Geyken 2, Lothar Lemnitzer 2, Angelika Storrer 1 1 Department of German Language
PyCantonese: Cantonese linguistic research in the age of big data
PyCantonese: Cantonese linguistic research in the age of big data Jackson L. Lee University of Chicago http://jacksonllee.com Childhood Bilingualism Research Center, CUHK September 15, 2015 Grammar versus
CorA: A web-based annotation tool for historical and other non-standard language data
CorA: A web-based annotation tool for historical and other non-standard language data Marcel Bollmann, Florian Petran, Stefanie Dipper, Julia Krasselt Department of Linguistics Ruhr-University Bochum,
The Use of Text Corpora in Lexical Research
The Use of Text Corpora in Lexical Research Stefan Engelberg Workshop, Universitatea din Bucureşti, November 2008 http://www.ids-mannheim.de/ll/lehre/engelberg/ Webseite_CorpLex/CorpLex.html [email protected]
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Modalverben Theorie. learning target. rules. Aim of this section is to learn how to use modal verbs.
learning target Aim of this section is to learn how to use modal verbs. German Ich muss nach Hause gehen. Er sollte das Buch lesen. Wir können das Visum bekommen. English I must go home. He should read
Machine Learning for natural language processing
Machine Learning for natural language processing Introduction Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 13 Introduction Goal of machine learning: Automatically learn how to
Checklist Use this checklist to find out how much English you already know. Grundstufe 1 (Common European Framework: A1 Level)
Der XL Test: Was können Sie schon? Schätzen Sie Ihre Sprachkenntnisse selbst ein! Sprache: Englisch Mit der folgenden e haben Sie die Möglichkeit, Ihre Fremdsprachenkenntnisse selbst einzuschätzen. Die
Dutch Parallel Corpus
Dutch Parallel Corpus Lieve Macken [email protected] LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT
Coffee Break German. Lesson 09. Study Notes. Coffee Break German: Lesson 09 - Notes page 1 of 17
Coffee Break German Lesson 09 Study Notes Coffee Break German: Lesson 09 - Notes page 1 of 17 LESSON NOTES ICH SPRECHE EIN BISSCHEN DEUTSCH In this lesson you will learn how to deal with language problems
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
Insights into Six Decades of Scientific Practice
DTA-/CLARIN-D-Konferenz Historische Textkorpora für die Geistes- und Sozialwissenschaften Title Insights into Six Decades of Scientific Practice Speaker Coauthors Gerhard Heyer, NLP chair ([email protected])
LEJ Langenscheidt Berlin München Wien Zürich New York
Langenscheidt Deutsch in 30 Tagen German in 30 days Von Angelika G. Beck LEJ Langenscheidt Berlin München Wien Zürich New York I Contents Introduction Spelling and pronunciation Lesson 1 Im Flugzeug On
Coffee Break German. Lesson 03. Study Notes. Coffee Break German: Lesson 03 - Notes page 1 of 15
Coffee Break German Lesson 03 Study Notes Coffee Break German: Lesson 03 - Notes page 1 of 15 LESSON NOTES ICH KOMME AUS DEUTSCHLAND. UND SIE? In this lesson of Coffee Break German we will learn to talk
German for beginners in 7 lessons
Olena Shypilova German for beginners in 7 lessons Study course 2012 German for beginners in 7 Lessons Thank you for choosing and joining our on-line German course. The course consists of 7 lessons. Due
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer Timur Gilmanov, Olga Scrivner, Sandra Kübler Indiana University
Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013
Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data
TEANLIS - Text Analysis for Literary Scholars
TEANLIS - Text Analysis for Literary Scholars Andreas Müller 1,3, Markus John 2,4, Jonas Kuhn 1,3 (1) Institut für Maschinelle Sprachverarbeitung Universität Stuttgart (2) Institut für Visualisierung und
A CHINESE SPEECH DATA WAREHOUSE
A CHINESE SPEECH DATA WAREHOUSE LUK Wing-Pong, Robert and CHENG Chung-Keng Department of Computing, Hong Kong Polytechnic University Tel: 2766 5143, FAX: 2774 0842, E-mail: {csrluk,cskcheng}@comp.polyu.edu.hk
CLARIN project DiscAn :
CLARIN project DiscAn : Towards a Discourse Annotation system for Dutch language corpora Ted Sanders Kirsten Vis Utrecht Institute of Linguistics Utrecht University Daan Broeder TLA Max-Planck Institute
Selected Papers from the CLARIN 2014 Conference
Selected Papers from the CLARIN 2014 Conference October 24-25, 2014 In Soesterberg (the Netherlands) Editor: Jan Odijk Linköping Electronic Conference Proceedings Selected Papers from the CLARIN 2014 Conference
QCF Qualifications in Languages. German. Level 1
QCF Qualifications in Languages German For teaching from September 2012 SPOKEN GERMAN: Communicating Personal Information Assessment Criteria 1.1 and 1.2 Task: Meeting New People CC- BY-SA 2007 Miraceti
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Module 6 Other OCR engines: ABBYY, Tesseract
Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 1 / 20 Module 6 Other OCR engines: ABBYY, Tesseract Uwe Springmann Centrum für Informations- und Sprachverarbeitung (CIS) Ludwig-Maximilians-Universität
Topological Field Chunking in German
Topological Field Chunking in German Jorn Veenstra, Frank H. Müller, Tylman Ule [veenstra,fhm,ule]@sfs.uni-tuebingen.de ESSLLI Summerschool Workshop on Machine Learning Approaches in Computational Linguistics
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Database-Supported XML Processors
Database-Supported XML Processors Prof. Dr. Torsten Grust [email protected] Winter 2008/2009 Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 1 Part I Preliminaries Torsten
NATIVE ADVERTISING, CONTENT MARKETING & CO. AUFBRUCH IN EIN NEUES GOLDENES ZEITALTER DES MARKETINGS?
NATIVE ADVERTISING, CONTENT MARKETING & CO. AUFBRUCH IN EIN NEUES GOLDENES ZEITALTER DES MARKETINGS? 2014 in Frankfurt am Main DATUM 11.11.2014 SEITE 2 Christian Paul Stobbe Director Strategy Düsseldorf
Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics
Volume 29 Number 2 2014 ISSN 2190-6858 Journal for Language Technology and Computational Linguistics Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface
Annotation and Evaluation of Swedish Multiword Named Entities
Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden [email protected] Introduction
German Language Resource Packet
German has three features of word order than do not exist in English: 1. The main verb must be the second element in the independent clause. This often requires an inversion of subject and verb. For example:
Year 7. Home Learning Booklet. French and German Skills
Year 7 Home Learning Booklet French and German Skills Name: Tutor Group: Teacher: Given out: 1 st July Hand in: 8 th July Parent/Carer Comment: Staff Comment: Target: This homework booklet practises some
Product Quality and Environmental Standards: The Effect of an International Environmental Agreement on Tropical Timber Trade
Please scroll down for the English version Sehr geehrte Abonnentinnen und Abonnenten, wir freuen uns, Sie per Newsletter über die neuesten Entwicklungen des FIW-Projekts informieren zu dürfen. Dieses Mal
The ALeSKo learner corpus: Design annotation quantitative. [short title for running heads: The ALeSKo learner corpus]
1 The ALeSKo learner corpus: Design annotation quantitative analyses [short title for running heads: The ALeSKo learner corpus] Heike Zinsmeister University of Konstanz Linguistics Department Box 185 78457
Statistical Machine Translation
Statistical Machine Translation What works and what does not Andreas Maletti Universität Stuttgart [email protected] Stuttgart May 14, 2013 Statistical Machine Translation A. Maletti 1 Main
Language Technology II Language-Based Interaction Dialogue design, usability,evaluation. Word-error Rate. Basic Architecture of a Dialog System (3)
Language Technology II Language-Based Interaction Dialogue design, usability,evaluation Manfred Pinkal Ivana Kruijff-Korbayová Course website: www.coli.uni-saarland.de/courses/late2 Basic Architecture
A year abroad. Aufgabenbeispiel für GY/RS. Teil I (Fundamentum) Leseverstehen: Aufgabenbeispiel für Gymnasium/Realschule
Aufgabenbeispiel für GY/RS Textvorlage und Aufgabenapparat Teil I (Fundamentum) Leseverstehen: Aufgabenbeispiel für Gymnasium/Realschule A year abroad 5 10 15 20 25 30 35 The following article from 2005
Successful Collaboration in Agile Software Development Teams
Successful Collaboration in Agile Software Development Teams Martin Kropp, Magdalena Mateescu University of Applied Sciences Northwestern Switzerland School of Engineering & School of Applied Psychology
Satzstellung. Satzstellung Theorie. learning target. rules
Satzstellung learning target Aim of this topic is to explain how to arrange the different parts of a sentence in the correct order. I must admit it took quite a long time to handle this topic and find
Das Verb to be im Präsens
Das Verb to be im Präsens 1. Person Sg. I am ich bin 2. Person Sg./Pl. 1. Person Pl. you we are du / ihr / Sie wir sind 3. Person Pl. they sie (Pl) 3. Person Sg. he she is er sie ist it es Imperativ: Be
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics
Technical Guidelines. for Power Generating Units. Part 7: Operation and maintenance of power plants for renewable energy Category D3 Attachment A:
Technical Guidelines for Power Generating Units Part 7: Operation and maintenance of power plants for renewable energy Category D3 Attachment A: Global Service Protocol (GSP) - Attachment A: XML schema
AP GERMAN LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES
AP GERMAN LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES Identical to Scoring Guidelines used for French, Italian, and Spanish Language and Culture Exams Interpersonal Writing: E-mail Reply 5: STRONG
Software / FileMaker / Plug-Ins Mailit 6 for FileMaker 10-13
Software / FileMaker / Plug-Ins Mailit 6 for FileMaker 10-13 Seite 1 / 5 Mailit 6 for FileMaker 10-13 The Ultimate Email Plug-In Integrate full email capability into your FileMaker 10-13 solutions with
Declarative Parsing and Annotation of Electronic Dictionaries
Declarative Parsing and Annotation of Electronic Dictionaries Christian Schneiker 1, Dietmar Seipel 1, Werner Wegstein 2, and Klaus Prätor 3 1 Department of Computer Science {schneiker seipel}@informatik.uni-wuerzburg.de
SharePoint Community Tools fürs Web 2.0
SharePoint Convention 2009 SharePoint Community Tools fürs Web 2.0 Michael Greth [email protected] Michael Greth Trainer, Consultant für SharePoint Microsoft MVP für Office SharePoint Server SharePointCommunity.de
TILA Research Results on Telecollaboration 1 Chapter 3
TILA Research Results on Telecollaboration 1 Chapter 3 TELECOLLABORATION FOR INTERCULTURAL FOREIGN LANGUAGE CONVERSATIONS IN SECONDARY SCHOOL CONTEXTS: TASK DESIGN AND PEDAGOGIC IMPLEMENTATION Petra Hoffstaedter
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
Master-Programm Deutsch als Fremdsprache (Master of Arts Program in German as a Foreign Language) an der Ramkhamhaeng Universität/Bangkok
Master-Programm Deutsch als Fremdsprache (Master of Arts Program in German as a Foreign Language) an der Ramkhamhaeng Universität/Bangkok Curriculum 2008 Man kann zwischen zwei Schwerpunkten wählen: Interkulturelle
