Using German corpora for linguistic purposes Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim
Introduction This talk will give a first impression of the complex field of German corpora and methods of corpus analysis. Before starting your work with corpora, be aware what a method can accomplish and what not.
Introduction Often I notice that overly complicated methods are used where simply collecting and counting instances would have been enough. Large collections of data and powerful automatic tools sometimes lead to an overvaluation of quantitive data.
Introduction Sometimes, the allure of numbers and frequencies leads to methodological laziness. Even today, the quality of linguistic interpretation is the most important factor regarding the informative value of the analysis. Corpus linguistics has not diminished the importance of the old cultural technique of reading and interpreting texts.
Introduction Today, I will highlight some ways how corpora and tools can help us linguists to get a high quality prestructuring of data This is particularly useful for examining high frequency phenomena which are important for language use identifying phenomena, which are not obvious to us, e.g. hidden structures and patterns
Introduction Focus is not on corpora or tools which need expert knowledge or have to be downloaded those are primarily used for automatic natural language processing e.g. Wortschatz Leipzig or IMS Open Corpus Workbench (Stuttgart) or TIGER (Berlin) Instead: Corpora which are available online and free of charge for the "common linguist"
German Introductions to Corpus Linguistics Lemnitzer, Lothar/Zinsmeister, Heike (2010): Korpuslinguistik. Eine Einführung. 2., durchgesehene und aktualisierte Aufl. (= Narr Studienbücher). Tübingen Perkuhn, Rainer/Keibel, Holger/Kupietz, Marc (2012): Korpuslinguistik. (=UTB 3433) Paderborn.
German Corpus Linguistics Website Noah Bubenhofer (2006-2011): Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge. www.bubenhofer.com/korpuslinguistik/kurs/
A Short History of German Corpora Institut for German Language a pioneer in the German speaking area since mid-1960s (!) Compilation of electronic text databases ( -> today: German reference corpus DeReKo) Development of COSMAS I, first platform for corpus analysis in the German speaking area (early 1990s 2003)
A Short History of German Corpora 2000 2003 Core corpus of the Digital Dictionary of the 20th century (Digitales Wörterbuch des 20. Jahrhunderts) at the Berlin-Brandenburgische Akademie der Wissenschaften; sponsored by the Deutsche Forschungsgemeinschaft DFG Since 2009 merged into C4 Corpus DWDS; Schweizer Textkorpus (Switzerland), Austrian Academic Corpus; Korpus Südtirol (South Tyrol) 80 million word tokens
Overview 1. German specialized corpora examples 2. German general reference corpora 1. DWDS 2. DeReKO 3. Methodological approaches 1. Consulting the corpus 2. Analysing the corpus statistical collocation analysis 4. Corpora and lexical ressources
German Specialized Corpora Spoken language: Database (DGD2) Archive Gesprochenes Deutsch (Spoken German) (IDS) Discourse analysis, Dialectology Dortmunder Chatkorpus
German Specialized Corpora Annotation: e.g. morpho-syntactically annotated corpora example: TIGER-Korpus (IMS Stuttgart) Language Learning: Learner corpora, errorannotated corpora example: FALKO (HU Berlin) Literature: Project Gutenberg; about 36.000 free ebooks (online)
Specialized corpora at the IDS Author corpora: Goethe corpus Dialects: Zwirner corpus, including corpus of venaculars of the former Eastern territories Genre: parliamentary debates, biographical fiction Historical period: Wendekorpus (1989/90) about 3,3 million word tokens articles, leaflets, flyers, parliamentary proceedings, speeches, declarations usw. Medium: Wikipedia corpus
German General Reference Corpora
German General Reference Corpora Not compiled for a specific use or for answering specific research questions As general as possible in order to be useful for various language studies DWDS and DeReKo
DWDS corpus: http://www.dwds.de/ in total: 2.5 billion; 1.8 billion word tokens publicly accessible (online and free) (several corpora) Core corpus: approx. 100 million word tokens Balanced in respect to time and genre (literature, journalistic prose, scientific texts, specialized texts (adverts, manuals etc.), spoken) Spans the 20th century Integrated with the DWDS Portal (dictionaries etc.)
The German Reference Corpus (DeReKo) and COSMAS II Institut für Deutsche Sprache, Mannheim (IDS) www.ids-mannheim.de
The German Reference Corpus DeReKo http://www1.idsmannheim.de/kl/projekte/korpora/ 6,1 billion word tokens (status as of 19.03.2013) Contains written German language texts of the present and recent past The largest "primordial sample of contemporary German" world wide online and free, registration required (copyright) List of corpora
The German Reference Corpus DeReKo Contains only copyrighted material Dynamic corpus (continually updated) Option to create personal subcorpora with COSMAS II which can be tailored towards specific research questions
Deutsches Referenzkorpus am IDS mit über 5,4 Milliarden Wörtern (Stand 29.02.2012) die weltweit größte linguistisch motivierte Sammlung elektronischer Korpora mit geschriebenen deutschsprachigen Texten aus der Gegenwart und der neueren Vergangenheit belletristische, wissenschaftliche und populärwissenschaftliche Texte, eine große Zahl von Zeitungstexten sowie eine breite Palette weiterer Textarten -> Analysesystem COSMAS II German corpora for linguistic purposes
COSMAS II Corpus Search, Management and Analysis System Not a web search engine Language independent Free online access since 1993 Ca. 30.000 registered users from over 100 countries
Search window KWIC Full text
Result presentation sources / corpora chronological alphabetical (successor/predecessor of the search object) randomized sorting text genres topics collocations Export of results
Analytical approaches
Paradigms of corpus analysis Looking for answers to my questions in the corpus -> validation of a priori knowledge ('consulting') Finding new research questions in the corpus and interpreting those -> best case: generating new knowledge ('analysing')
Consulting the corpus Do specific language elements (e.g. morphemes, lexemes, multi-word units) occur at all and if they do, how often? Which usage based aspects of meaning can be identified? In which situations are they used? What is the typical base form in the corpus? Which variations can be found?
Consulting the corpus Discourse Globalisierung bedeutet (Globalization means) (Teubert 2006) Text type ( birthday textes; advertises) geistige Frische Regional Differences Samstag vs. Sonnabend Germany, Austria, Switzerland
Das Korpus befragen (corpus-based) Areale Besonderheit Grumbeere? auf freiem Fuß anzeigen? Schreibung? Blind date oder Blind Date oder? Diskurs? Besserwessi
Consulting the corpus Samstag is used in all German speaking areas Sonnabend is used almost exclusively in Germany Chronological (e.g. new lexems, multi word units) voll krass
Search strategies - Example Exclusionary searches Excluding hits that are not relevant Verifiying stability and variance S: Übung macht ART WITHOUT Meister Query: (&Übung /+w1 &machen) /+w1 (den ODER die ODER das)) &s0 &Meister S: macht den Meister WITHOUT Übung Query: (&machen /+w2 &Meister) %s0 &Übung
Patterns: Übung macht den X M11 Übung macht den Kegelmeister M99 Übung macht den Handball-Meister M99 Übung macht auch hier den Zaubermeister. RHZ11 A97 A00 A09 F99 Übung macht die Meisterin Übung macht Radioprediger Übung macht den Schützen Übung macht den Feuerwehrmann Übung macht den Gourmet linguistic purposes German corpora for
Patterns: X macht den Meister B06 Technik macht den Meister Tipps für Anfänger B07 Energie macht den Meister B07 Vorsicht macht den Meister. BVZ07 Die Praxis macht den Meister zu Schulbeginn E99 Doch erst Playoff macht den Meister. M00 Ob Profi oder Schnuppersportler - Training macht den Meister. linguistic purposes German corpora for
Other phenomena: Word formation Productivity in word formation *mentalität
Other phenomena: Grammar Search in a morpho-syntactically annotated corpus Relatively small in comparison with the whole corpus archive Adjektive - Kopf (in a subcorpus) All dative nouns followed by a dative relative pronoun within a span of three tokens maximum Query: MORPH(NOU dat) /+w3 MORPH(PRN rel dat)
Grammar Phenomena Plea for search in non-annotated corpora, even for grammatical research questions Completely abstract constructions not searchable, lexical anchor necessary BUT: Larger corpus size can lead to surprising results Example: all when without comma
Drowning in a flood of mass data? BUT The bigger the data set, the more overwhelming for humans Example Kopf
Collocation analysis at the IDS Cyril Belica: Statistische Kollokationsanalyse und Clustering. Korpuslinguistische Analysemethode. 1995 Institut für Deutsche Sprache, Mannheim. Tutorial 2004: Short introduction to collocation analysis http://www.ids-mannheim.de/kl/misc/tutorial.html Cp. Perkuhn/Keibel/Kupietz (2012)
Teil 2 Praktische Übungen
Collocation analysis at the IDS Focusses on lexical cooccurrences Dynamically computed on the latest version of the corpus Flexible adjustment of parameters (e.g. span and position, granularity, functions word y/n) Computes not only word collocate pairs, but also hierarchical clusters and common syntagmatic patterns
Collocation cluster CA for Kopf
Interpreting Clusters Collocation clusters are only indicators for the contexts on which they are based Syntagmatic perspective is most important KWIC cluster Full text cluster
Collocation analysis at the IDS Collocations Phrasemes fixed syntagmatic structures fixed context patterns (access to meaning and common usage)
You shall know a word by the company it keeps (Firth 1957)
Usage clusters: semantical 'injury by external force' Kugel / gegen die Wand stoßen/geschlagen / Platzwunde am Kopf / verletzt / geschossen / an die Bande prallen / abgeschlagenen / Brustverletzung / Beule 'body part' Hals / Nacken / Bauch / Oberkörper / Arme 'symptoms of illness' Gliederschmerzen heiß
Usage clusters: phrasemes 'emotional state' mit hängenden Köpfen ('dejected') / mit kühlem Kopf ('level-headed') / mit hochrotem Kopf ('angry' 'embarassed') / mit gesenktem Kopf ('abashed')
Colloctions collocation patterns Mutual lexical fixedness Hals über Kopf ('rushed') (*X über Kopf; *Hals über X) Semantically restricted usage mit hochroten Kopf CA hochrot -> hochrot only with body parts (prototypical: Kopf) Productive collocation patterns strategischer Kopf / führende / kreative / beste Köpfe ('leader mastermind')
Context patterns Pragmatic Orality / colloquial speech in the corpus voll krass Usage of word classes, formulae, particles, sentence adverbs etc. Example: ernsthaft Discourse: Globalisierung
German collocation resources Pro: fast access Contra: no dynamic customization possible DWDS word profiles Collocations in Wortschatz Leipzig IDS- Collocation Database CCDB (Kookkurrenzdatenbank) Pre-analysed profiles of 220.000 lemmas + KWIC Semantic proximity by comparing CA profiles (e.g. anscheinend vs. scheinbar)
Collocation analysis clustering typical contexts of usage is an analytical approach that is central for all kinds of linguistic research questions, if you interested in "language in use" (this can also be "syntax in use")
IDS linguistic applications Corpus-based grammar (grammis) Lexicon-grammar-interface: valency, argument structure and construction grammar DeReKo, IMS Workbench, other Spoken language: Variation des gesprochenen Deutsch: Standardsprache Alltagssprache"
IDS linguistic applications of CA Corpus-based and driven lexicology and lexicography OWID (e.g. elexiko; dictionary of modern german proverbs ) Multilingual Proverb-Online-Platform Fields of lexical pattern and phrasem-constructions -> Qualitative linguistic interpretation of collocation and syntagmatic profiles
Outlook
Integrative Platforms Authentic corpus data Qualitative Descriptions Lexical resources (e.g. collocation profiles and networks) Web (DWDS; OWID)
KorAP KorAP: The next generation corpus analysis platform of the Institute for German Language Replaces COSMAS II (but features will be reproduced) Extends the possiblities of individual corpus design (e.g. by topic, by text type) Several levels of linguistic annotation Basic and extended search functionality; faster
Thank you for your attention!