Using German corpora for linguistic purposes. Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim



Similar documents
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

The Use of Text Corpora in Lexical Research

How To Write A German Reference Corpus Of Computer Mediated Communication

Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

Hybrid Strategies. for better products and shorter time-to-market

Extracting translation relations for humanreadable dictionaries from bilingual text

Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY

LINGUISTIC SUPPORT IN "THESIS WRITER": CORPUS-BASED ACADEMIC PHRASEOLOGY IN ENGLISH AND GERMAN

Processing Dialogue-Based Data in the UIMA Framework. Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Complex Predications in Argument Structure Alternations

What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project

Simple maths for keywords

Pragmatic analysis of hotel websites in terms of interpersonal relationships. Theses of the PhD dissertation by. Kovács Péterné Dudás Andrea

Local Culture in Global English:

Local Culture in Global English:

Transcription bottleneck of speech corpus exploitation

Using the BNC to create and develop educational materials and a website for learners of English

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web

EFL Learners Synonymous Errors: A Case Study of Glad and Happy

CURRICULUM VITAE. M. Sc. Anne-Katharina Schiefele

Search Engines Chapter 2 Architecture Felix Naumann

An Introduction to TextGrid

Master of Arts in Linguistics Syllabus

Master-Programm Deutsch als Fremdsprache (Master of Arts Program in German as a Foreign Language) an der Ramkhamhaeng Universität/Bangkok

Course Content. The following course units will be offered:

LEJ Langenscheidt Berlin München Wien Zürich New York

Checklist Use this checklist to find out how much English you already know. Grundstufe 1 (Common European Framework: A1 Level)

Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus

Quantitative Text Typology The Impact of Sentence Length

University of Massachusetts Boston Applied Linguistics Graduate Program. APLING 601 Introduction to Linguistics. Syllabus

Cultural Trends and language change

Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions. Natalia Levshina

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

SAP Enterprise Portal 6.0 KM Platform Delta Features

NoSta-D: A Corpus of German Non-standard Varieties

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,

Exploiting Sign Language Corpora in Deaf Studies

German Language Resource Packet

WebLicht: Web-based LRT services for German

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Course: German 1 Designated Six Weeks: Weeks 1 and 2. Assessment Vocabulary Instructional Strategies

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

Insights into Six Decades of Scientific Practice

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Electronic offprint from. baltic linguistics. Vol. 3, 2012

Study Plan for Master of Arts in Applied Linguistics

Download Check My Words from:

ICAME Journal No. 24. Reviews

in Language, Culture, and Communication

Multilingual and mixed-lingual TTS applications

Services supply chain management and organisational performance

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

Comprendium Translator System Overview

Breatling the Meaning of Tag Sets in CMC corpora

The Rise of Documentary Linguistics and a New Kind of Corpus

Declarative Parsing and Annotation of Electronic Dictionaries

CURRICULUM VITAE SILKE BRANDT

A History of the «Concise Oxford Dictionary»

DiaCollo: On the trail of diachronic collocations

CLARIN project DiscAn :

Master of Arts Program in Linguistics for Communication Department of Linguistics Faculty of Liberal Arts Thammasat University

Security Vendor Benchmark 2016 A Comparison of Security Vendors and Service Providers

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

BACKUP EAGLE. Release Notes. Version: Date: 11/25/2011

Department of English. University of Innsbruck

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Level 2 German, 2014

Big Data Vendor Benchmark 2015 A Comparison of Hardware Vendors, Software Vendors and Service Providers

SHORT, August THE KLEINE ZEITUNG INTRODUCES ITSELF. From the two-shilling daily to a multimedia brand

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Reference Books. (1) English-English Dictionaries. Fiona Ross FindYourFeet.de

Working Paper Series des Rates für Sozial- und Wirtschaftsdaten, No. 163

Transcription:

Using German corpora for linguistic purposes Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim

Introduction This talk will give a first impression of the complex field of German corpora and methods of corpus analysis. Before starting your work with corpora, be aware what a method can accomplish and what not.

Introduction Often I notice that overly complicated methods are used where simply collecting and counting instances would have been enough. Large collections of data and powerful automatic tools sometimes lead to an overvaluation of quantitive data.

Introduction Sometimes, the allure of numbers and frequencies leads to methodological laziness. Even today, the quality of linguistic interpretation is the most important factor regarding the informative value of the analysis. Corpus linguistics has not diminished the importance of the old cultural technique of reading and interpreting texts.

Introduction Today, I will highlight some ways how corpora and tools can help us linguists to get a high quality prestructuring of data This is particularly useful for examining high frequency phenomena which are important for language use identifying phenomena, which are not obvious to us, e.g. hidden structures and patterns

Introduction Focus is not on corpora or tools which need expert knowledge or have to be downloaded those are primarily used for automatic natural language processing e.g. Wortschatz Leipzig or IMS Open Corpus Workbench (Stuttgart) or TIGER (Berlin) Instead: Corpora which are available online and free of charge for the "common linguist"

German Introductions to Corpus Linguistics Lemnitzer, Lothar/Zinsmeister, Heike (2010): Korpuslinguistik. Eine Einführung. 2., durchgesehene und aktualisierte Aufl. (= Narr Studienbücher). Tübingen Perkuhn, Rainer/Keibel, Holger/Kupietz, Marc (2012): Korpuslinguistik. (=UTB 3433) Paderborn.

German Corpus Linguistics Website Noah Bubenhofer (2006-2011): Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge. www.bubenhofer.com/korpuslinguistik/kurs/

A Short History of German Corpora Institut for German Language a pioneer in the German speaking area since mid-1960s (!) Compilation of electronic text databases ( -> today: German reference corpus DeReKo) Development of COSMAS I, first platform for corpus analysis in the German speaking area (early 1990s 2003)

A Short History of German Corpora 2000 2003 Core corpus of the Digital Dictionary of the 20th century (Digitales Wörterbuch des 20. Jahrhunderts) at the Berlin-Brandenburgische Akademie der Wissenschaften; sponsored by the Deutsche Forschungsgemeinschaft DFG Since 2009 merged into C4 Corpus DWDS; Schweizer Textkorpus (Switzerland), Austrian Academic Corpus; Korpus Südtirol (South Tyrol) 80 million word tokens

Overview 1. German specialized corpora examples 2. German general reference corpora 1. DWDS 2. DeReKO 3. Methodological approaches 1. Consulting the corpus 2. Analysing the corpus statistical collocation analysis 4. Corpora and lexical ressources

German Specialized Corpora Spoken language: Database (DGD2) Archive Gesprochenes Deutsch (Spoken German) (IDS) Discourse analysis, Dialectology Dortmunder Chatkorpus

German Specialized Corpora Annotation: e.g. morpho-syntactically annotated corpora example: TIGER-Korpus (IMS Stuttgart) Language Learning: Learner corpora, errorannotated corpora example: FALKO (HU Berlin) Literature: Project Gutenberg; about 36.000 free ebooks (online)

Specialized corpora at the IDS Author corpora: Goethe corpus Dialects: Zwirner corpus, including corpus of venaculars of the former Eastern territories Genre: parliamentary debates, biographical fiction Historical period: Wendekorpus (1989/90) about 3,3 million word tokens articles, leaflets, flyers, parliamentary proceedings, speeches, declarations usw. Medium: Wikipedia corpus

German General Reference Corpora

German General Reference Corpora Not compiled for a specific use or for answering specific research questions As general as possible in order to be useful for various language studies DWDS and DeReKo

DWDS corpus: http://www.dwds.de/ in total: 2.5 billion; 1.8 billion word tokens publicly accessible (online and free) (several corpora) Core corpus: approx. 100 million word tokens Balanced in respect to time and genre (literature, journalistic prose, scientific texts, specialized texts (adverts, manuals etc.), spoken) Spans the 20th century Integrated with the DWDS Portal (dictionaries etc.)

The German Reference Corpus (DeReKo) and COSMAS II Institut für Deutsche Sprache, Mannheim (IDS) www.ids-mannheim.de

The German Reference Corpus DeReKo http://www1.idsmannheim.de/kl/projekte/korpora/ 6,1 billion word tokens (status as of 19.03.2013) Contains written German language texts of the present and recent past The largest "primordial sample of contemporary German" world wide online and free, registration required (copyright) List of corpora

The German Reference Corpus DeReKo Contains only copyrighted material Dynamic corpus (continually updated) Option to create personal subcorpora with COSMAS II which can be tailored towards specific research questions

Deutsches Referenzkorpus am IDS mit über 5,4 Milliarden Wörtern (Stand 29.02.2012) die weltweit größte linguistisch motivierte Sammlung elektronischer Korpora mit geschriebenen deutschsprachigen Texten aus der Gegenwart und der neueren Vergangenheit belletristische, wissenschaftliche und populärwissenschaftliche Texte, eine große Zahl von Zeitungstexten sowie eine breite Palette weiterer Textarten -> Analysesystem COSMAS II German corpora for linguistic purposes

COSMAS II Corpus Search, Management and Analysis System Not a web search engine Language independent Free online access since 1993 Ca. 30.000 registered users from over 100 countries

Search window KWIC Full text

Result presentation sources / corpora chronological alphabetical (successor/predecessor of the search object) randomized sorting text genres topics collocations Export of results

Analytical approaches

Paradigms of corpus analysis Looking for answers to my questions in the corpus -> validation of a priori knowledge ('consulting') Finding new research questions in the corpus and interpreting those -> best case: generating new knowledge ('analysing')

Consulting the corpus Do specific language elements (e.g. morphemes, lexemes, multi-word units) occur at all and if they do, how often? Which usage based aspects of meaning can be identified? In which situations are they used? What is the typical base form in the corpus? Which variations can be found?

Consulting the corpus Discourse Globalisierung bedeutet (Globalization means) (Teubert 2006) Text type ( birthday textes; advertises) geistige Frische Regional Differences Samstag vs. Sonnabend Germany, Austria, Switzerland

Das Korpus befragen (corpus-based) Areale Besonderheit Grumbeere? auf freiem Fuß anzeigen? Schreibung? Blind date oder Blind Date oder? Diskurs? Besserwessi

Consulting the corpus Samstag is used in all German speaking areas Sonnabend is used almost exclusively in Germany Chronological (e.g. new lexems, multi word units) voll krass

Search strategies - Example Exclusionary searches Excluding hits that are not relevant Verifiying stability and variance S: Übung macht ART WITHOUT Meister Query: (&Übung /+w1 &machen) /+w1 (den ODER die ODER das)) &s0 &Meister S: macht den Meister WITHOUT Übung Query: (&machen /+w2 &Meister) %s0 &Übung

Patterns: Übung macht den X M11 Übung macht den Kegelmeister M99 Übung macht den Handball-Meister M99 Übung macht auch hier den Zaubermeister. RHZ11 A97 A00 A09 F99 Übung macht die Meisterin Übung macht Radioprediger Übung macht den Schützen Übung macht den Feuerwehrmann Übung macht den Gourmet linguistic purposes German corpora for

Patterns: X macht den Meister B06 Technik macht den Meister Tipps für Anfänger B07 Energie macht den Meister B07 Vorsicht macht den Meister. BVZ07 Die Praxis macht den Meister zu Schulbeginn E99 Doch erst Playoff macht den Meister. M00 Ob Profi oder Schnuppersportler - Training macht den Meister. linguistic purposes German corpora for

Other phenomena: Word formation Productivity in word formation *mentalität

Other phenomena: Grammar Search in a morpho-syntactically annotated corpus Relatively small in comparison with the whole corpus archive Adjektive - Kopf (in a subcorpus) All dative nouns followed by a dative relative pronoun within a span of three tokens maximum Query: MORPH(NOU dat) /+w3 MORPH(PRN rel dat)

Grammar Phenomena Plea for search in non-annotated corpora, even for grammatical research questions Completely abstract constructions not searchable, lexical anchor necessary BUT: Larger corpus size can lead to surprising results Example: all when without comma

Drowning in a flood of mass data? BUT The bigger the data set, the more overwhelming for humans Example Kopf

Collocation analysis at the IDS Cyril Belica: Statistische Kollokationsanalyse und Clustering. Korpuslinguistische Analysemethode. 1995 Institut für Deutsche Sprache, Mannheim. Tutorial 2004: Short introduction to collocation analysis http://www.ids-mannheim.de/kl/misc/tutorial.html Cp. Perkuhn/Keibel/Kupietz (2012)

Teil 2 Praktische Übungen

Collocation analysis at the IDS Focusses on lexical cooccurrences Dynamically computed on the latest version of the corpus Flexible adjustment of parameters (e.g. span and position, granularity, functions word y/n) Computes not only word collocate pairs, but also hierarchical clusters and common syntagmatic patterns

Collocation cluster CA for Kopf

Interpreting Clusters Collocation clusters are only indicators for the contexts on which they are based Syntagmatic perspective is most important KWIC cluster Full text cluster

Collocation analysis at the IDS Collocations Phrasemes fixed syntagmatic structures fixed context patterns (access to meaning and common usage)

You shall know a word by the company it keeps (Firth 1957)

Usage clusters: semantical 'injury by external force' Kugel / gegen die Wand stoßen/geschlagen / Platzwunde am Kopf / verletzt / geschossen / an die Bande prallen / abgeschlagenen / Brustverletzung / Beule 'body part' Hals / Nacken / Bauch / Oberkörper / Arme 'symptoms of illness' Gliederschmerzen heiß

Usage clusters: phrasemes 'emotional state' mit hängenden Köpfen ('dejected') / mit kühlem Kopf ('level-headed') / mit hochrotem Kopf ('angry' 'embarassed') / mit gesenktem Kopf ('abashed')

Colloctions collocation patterns Mutual lexical fixedness Hals über Kopf ('rushed') (*X über Kopf; *Hals über X) Semantically restricted usage mit hochroten Kopf CA hochrot -> hochrot only with body parts (prototypical: Kopf) Productive collocation patterns strategischer Kopf / führende / kreative / beste Köpfe ('leader mastermind')

Context patterns Pragmatic Orality / colloquial speech in the corpus voll krass Usage of word classes, formulae, particles, sentence adverbs etc. Example: ernsthaft Discourse: Globalisierung

German collocation resources Pro: fast access Contra: no dynamic customization possible DWDS word profiles Collocations in Wortschatz Leipzig IDS- Collocation Database CCDB (Kookkurrenzdatenbank) Pre-analysed profiles of 220.000 lemmas + KWIC Semantic proximity by comparing CA profiles (e.g. anscheinend vs. scheinbar)

Collocation analysis clustering typical contexts of usage is an analytical approach that is central for all kinds of linguistic research questions, if you interested in "language in use" (this can also be "syntax in use")

IDS linguistic applications Corpus-based grammar (grammis) Lexicon-grammar-interface: valency, argument structure and construction grammar DeReKo, IMS Workbench, other Spoken language: Variation des gesprochenen Deutsch: Standardsprache Alltagssprache"

IDS linguistic applications of CA Corpus-based and driven lexicology and lexicography OWID (e.g. elexiko; dictionary of modern german proverbs ) Multilingual Proverb-Online-Platform Fields of lexical pattern and phrasem-constructions -> Qualitative linguistic interpretation of collocation and syntagmatic profiles

Outlook

Integrative Platforms Authentic corpus data Qualitative Descriptions Lexical resources (e.g. collocation profiles and networks) Web (DWDS; OWID)

KorAP KorAP: The next generation corpus analysis platform of the Institute for German Language Replaces COSMAS II (but features will be reproduced) Extends the possiblities of individual corpus design (e.g. by topic, by text type) Several levels of linguistic annotation Basic and extended search functionality; faster

Thank you for your attention!