Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis 3.1 Corpus analysis software I: AntConc 3.2 KWICs and concordances 3.3 Corpus analysis software II: COSMAS II Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1] Corpus analysis software I: AntConc 3.1 Software I: AntConc AntConc Developer: Laurence Anthony, Faculty of Science and Engineering,Waseda University, Japan. Version: 3.2.1w (Windows), release March 10th, 2007. Search: offline. Software: installed on a local computer. Access: free download. Corpora: own (txt-files). Languages: all (Unicode), e. g., German, Englisch, Romanian, Mongolian. URL: http://www.antlab.sci.waseda.ac.jp/antconc_index.html. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 2] 1
3.1 Software I: AntConc co-occurrence analysis frequencies / word list key word analysis cluster analysis concordances Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 3] 3.1 Software I: AntConc can be recommended with smaller corpora (up to 20 mio. running words) strenghts: sorted concordances, word lists, cluster analyses, key word analyses less useful for co-occurrence analyses (too slow; larger corpora are needed) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 4] 2
3.2 KWICs and concordances Concordances Concordance A concordance is a collection of cotexts of a particular key word. Cotexts of a specified length (of letters, words, or sentences) around a key word are extracted from a corpus and ordered with the key word in the center. Lemnitzer, Lothar und Heike Zinsmeister. Korpuslinguistik. Eine Einführung. Tübingen: Narr, 2006. S. 196f. KWIC A KWIC ( Key word in context ) is a single cotext of a particular key word. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 5] Search: concordances for helps in part of the English corpus of the Leipzig Corpus Collection (newspapers). Search term (here: helps) Sort (here: alphabetically according to the word to the right of the search term) Cotext (here: 200 char.) Hits (here: 56) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 6] 3
3.2 KWICs and concordances Export of results as a txt-file Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 7] Search: Concordances for depăşeşte in a small collection of Romanian texts (Unicode) 3.2 KWICs and concordances Reset language settings to Unicode (utf8) in Global Settings / Language Encoding Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 8] 4
Corpus analysis software II: COSMAS II COSMAS II is the corpus analysis system at the Institut für Deutsche Sprache. It comes in two versions: COSMAS II Client for Windows COSMAS II WWW-interface the WWW-interface has fewer functions than the client both access the same corpora the search is carried out online in both versions Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 9] COSMAS II (Windows Client) Developer: Institut für deutsche Sprache. Version: 3.61 (Windows). Search: Online. Software: local installation. Zugang: free download of analysis software; registration necessary. Korpora: DeReKo (Corpora of the IDS). Languages: German (3,4 bn. running words). URL: http://www.ids-mannheim.de/cosmas2/install/. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 10] 5
After program start: load corpora Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 11] Search option I: line-based Step 1: formulation of search request Search expression, here: &behaupten /+w2 (dass oder daß) [Search for records for the lexeme behaupten (&behaupten), up to 2 words apart (/+w2) from the word form dass or the word from daß (dass oder daß)] Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 12] 6
Step 2: Determine search and lemmatization options Search options (treatment of upper cases, frequency information, sort options, limit of hits). Lemmatization options ( Grundformenoperator supports search for inflected forms and compounds, etc.). Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 13] Step 3: Choose word forms from expansion list Selection of word forms Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 14] 7
Number of hits for search expression (here: 15904) Step 4: Confirm intermediate statistics of search request Move to display of records Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 15] Step 5: Request KWICs (Menü: Ansicht) Display (here: Korpusansicht) Change display (here: request KWICs) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 16] 8
Step 6: Request full text Full text option Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 17] Result Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 18] 9
Korpusanalyse am IDS COSMAS II Search option II: template-based Step 1: formulation of search request Search expression, here: &behaupten /+w2 (dass oder daß) [templates can be moved from the left column into the center] further steps: as with line-based request Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 19] COSMAS II (WWW interface) Developer: Institut für deutsche Sprache. Version: 1.21. Search: Online. Software: Online. Access: free; registration necessary. Korpora: Deutsches Referenzkorpus (IDS-corpora). Languages: German (3,4 bn. running words). URL: https://cosmas2.ids-mannheim.de/cosmas2-web/. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 20] 10
After program start: load corpora Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 21] After program start: load corpora Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 22] 11
After program start: load corpora Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 23] Only search option: line-based Search expression, here: &behaupten /+w2 (dass oder daß) Step 1: Formulate search request Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 24] 12
Schritt 2 (optional): Determine search and lemmatization options (as with Client) Options Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 25] Step 3 (optional): Choose word forms from expansion list Step 4: Display results Results Open expansion lists Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 26] 13
Step 5: Choose type of KWIC display Numer of hits for search expression (here: 15904) Options for the display of results (by month, by year, by decade, ) KWIC display Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 27] Step 6: Request full text Full text option Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 28] 14
Result Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 29] Syntax of search language Some examples Funktion Lemmasuche Wortformensuche Wortkettensuche Wortteilsuche Abstandssuche und-suche Suche mit Tags Beispiel &spielen spielte &spielen /+w1 &Domino spiele /+w1 &Domino *spiel &spielen /+w3 &Domino Domino /s0 Schach Suchziel: Belege mit beliebigen Wortformen des Lexems spielen der Wortform spielte Wortketten, die aus einer beliebigen Wortform von spielen gefolgt von einer beliebigen Wortform von Domino bestehen Wortketten, die aus der Wortform spiele gefolgt von einer beliebigen Wortform von Domino bestehen einer Wortform, die auf spiel endet Wortketten, die aus einer beliebigen Wortform von spielen gefolgt im Abstand von bis zu 3 Wörtern von einer beliebigen Wortform von Domino bestehen sowohl der Wortform Domino als auch der Wortform Schach Wortketten, die aus einer beliebigen Wortform von haben gefolgt von einem Infinitiv und der Wortform können bestehen Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 30] 15
Example for search in COSMAS II Looking for: dass-clauses as sentential subject with the verb helfen ( to help ). Assumption: Sentential subjects with helfen often occur within constructions like <[ ] es [ ] hilft, dass/daß>. Search: (es /+w3 &helfen) /+w1 (dass oder daß) Beispiele T04 Der SPD hat es nicht geholfen, dass der Sympathieträger und B99 Uns könne es nur helfen, dass wir so früh den Weg zu B02 Vielleicht hat es Metzelder geholfen, dass die Kollegen seinen E96 Da wird es auch nicht helfen, dass der Publikumsrat E99 Mir hat es viel geholfen, dass ich Kabuki-Theater N98 "Uns könnte es helfen, daß gleichzeitig Landtagswahl ist", P93 Saddam Hussein könnte es helfen, daß Zulieferstaaten... eine volle P98 "Wenn es Saddam hilft, daß Unscom von Diplomaten R99 Was kann es nun helfen, daß inzwischen 13 der 15 Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 31] 16