Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]

Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis 8.1 Cluster analysis 8.2 Co-occurrence 8.3 CCDB & IDS co-occurrence analysis 8.4 Searching for collocations Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1] word group analysis 8.1 Cluster analysis Cluster A cluster is a chain of linguistic entities. In er sprach vor einem großen Publikum, spr is a consonant cluster consisting of 3 consonants und sprach vor einem a word cluster consisting of 3 words. n-gram A n-gram is a sequence of n linguistic elements of the same type (Kunze & Lemnitzer 2007: 190) A 4-gram of words is a sequence of 5 words. A n-gram is the same as a n- cluster. The term n-gram is used in particular if all n-cluster are extracted from a corpus. Kunze, Claudia und Lothar Lemnitzer. Computerlexikographie. Eine Einführung. Tübingen: Narr [E-Book], 2007. S. 190. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 2] 1

1 Mongolia / Languages Search: clusters 2 Publishing out of 2 dictionaries words ending in off in part of the 3 Corpus linguistics English corpus of the LCC 4 Improving dictionaries 5 Outlook Search term position (here: on right) Search term (here: off) List of bi-grams with rank and fequency Sort (here: accord. to frequency of the cluster) Size of cluster (here: clusters out of two words) Frequency condition (here: at least three tokens) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 3] Co-occurrence 2.4 Co-occurrence Co-occurrence In a general sense, the term co-occurrence refers to the occurrence of two expressions close to each other. In a more specific sense, the term cooccurrence is used when the two expression occur more often together than can be expected if all words were distributed by chance. co-occurrence analysis the basic idea 1) Assumption: In a certain corpus, word X occurs a 1000 times, word Y a 100 times, word Z 10 times. 2) Probability: The combination XY is ten times as likely as the combination XZ. XY should occur ten times as often as XZ. 3) Observation: Actually, XZ occurs about as often as XY. 4) Conclusion: There is a close linguistic connection between X and Z (close beyond expectation). Kunze, Claudia und Lothar Lemnitzer. Computerlexikographie. Eine Einführung. Tübingen: Narr [E-Book], 2007. S. 391f. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 4] 2

1 Mongolia / Languages Search: co-occurrences for just in part 2 Publishing of the English dictionaries corpus of 3 Corpus the LCC. linguistics 4 Improving dictionaries 5 Outlook List of co-occurrence partner words with rank, frequency, and significance measure Search term (here: just) Definition of search context (here: up to 2 words after the search term) Sort (here: according to significance of co-occurrence) Frequency condition (here: at least 10 tokens) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 5] 8.3 CCDB & IDS co-occurrence analysis Co-occurrence analysis at the IDS Access: via COSMAS II WWW interface via COSMAS II client via CCDB (co-occurrence databasa) WWW interface and client: Co-occurrences are computed online (takes some time); several options for fine-tuning the analysis are available. CCDB: results of co-occurrence analyses are stored (fast access); no finetuning of analysis; automatic comparison of collocation profies available Quelle: Belica, Cyril: Kookkurrenzdatenbank CCDB. Eine korpuslinguistische Denkund Experimentierplattform für die Erforschung und theoretische Begründung von systemisch-strukturellen Eigenschaften von Kohäsionsrelationen zwischen den Konstituenten des Sprachgebrauchs. 2001-2007 Institut für Deutsche Sprache, Mannheim. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 6] 3

Anwendungsbeispiel II: Kookkurrenzen zu bestehen Question: co-occurrences for bestehen (in particular governed prepositions). 1 Mongolia Textkorpora / Languages 2 Publishing Recherchemethoden dictionaries 3 Corpus Anwendungen linguistics 4 Improving Rechercheprogramme dictionaries 5 Outlook Schlussbemerkung Co-occurrence analysis for bestehen as part of the CCDB (setting: do not ignore function words) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 7] Anwendungsbeispiel II: Kookkurrenzen zu bestehen Question: co-occurrences for bestehen (in particular governed prepositions). 1 Mongolia Textkorpora / Languages 2 Publishing Recherchemethoden dictionaries 3 Corpus Anwendungen linguistics 4 Improving Rechercheprogramme dictionaries 5 Outlook Schlussbemerkung Typical syntagmatic patterns in which the words co-occur, e. g. besteht aus [ ] [zwei drei] Teilen, consists of [ ] [two three] parts Secondary co-occurrence partners of bestehen + aus, here: aus Mitgliedern / Teilen / Ortsteilen bestehen, consist of members / parts / suburbs Primary co-occurrence partner of bestehen (here: aus) Strength of the connection (here: 40683) Co-occurrence analysis for bestehen as part of the CCDB (setting: do not ignore function words) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 8] 4

8.3 CCDB & IDS co-occurrence analysis Results (among others) aus: besteht [ ] aus ( consists of [ ] ) besteht [ ] aus [ ] Mitgliedern ( consists [ ] of [ ] members ) darin: besteht [ ] darin, dass ( is [ ] that ) die Schwierigkeit [ ] besteht [ ] darin, dass ( the difficulty [ ] is [ ] that ) darauf: besteht [ ] darauf, dass ( insists [ ] that ) er bestand [ ] darauf, dass ( he insisted [ ] that ) worin: worin [ ] besteht worin [ ] besteht der Unterschied zwischen ( what [ ] is the difference between ) governed preposition: auf, aus, in prepositions auf and in in particular as prepositional complement clauses preposition in often in interrogative sentences Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 9] 8.4 Searching for collocations Exploration of collocations and fixed expressions Article from a German-Mongolian dictionary (preliminary version). 20 Flaschen à 8 Euro, 20 bottles at 8 Euros each Task: Find relevant collocations and fixed expressions containing à. Procedure: 1) Retrieve concordances from a smaller corpus (AntConc with part of the German corpus from the Leipzig Corpus Collection). 2) Carry out co-occurrence analysis (CCDB, Deutsches Referenzkorpus ). Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 10] 5

8.4 Searching for collocations Concordances for à in a 1-million-RW selection of the German corpus within the LCC Fixed expression à la, after the fashion of (5 out of 10 hits) Fixed expression peu à peu, bit by bit (1 out of 10 hits) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 11] Co-occurrence analysis on the basis of the Deutsches Referenzkorpus (based on 2 bn. RW); COSMAS II WWW interface 1 Mongolia / Languages 2 Publishing dictionaries la as the most siginificant cooccurrence partner of à 3 Corpus linguistics (log likelihood ratio: 4 Improving 135300) dictionaries 5 Outlook Both collocations, à la and peu à peu are missing in the dictionary. peu as the second most siginificant co-occurrence partner of à (log likelihood ratio: 15974) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 12] 6

VICOMTE Kookkurrenzexplorer i) primary and secondary co-occurrence partner diagramed Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 13] ii) Co-occurrence partners can be annotated Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 14] 7

iii) co-occurrencepartners can be grouped Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 15] Perkuhn, Rainer: Systematic Exploration of Collocation Profiles. In: Proceedings of 4th Corpus Linguistics 2007, Birmingham. http://corpus.bham.ac.uk/corplingproceedings07/p aper/132_paper.pdf. iv) Refinement of description Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 16] 8