Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]



Similar documents
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Making a Dictionary in Ulaanbaatar:

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

The Use of Text Corpora in Lexical Research

Using German corpora for linguistic purposes. Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim

Complex Predications in Argument Structure Alternations

Data Deduplication in Slovak Corpora

Terminology Extraction from Log Files

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web

Simple maths for keywords

Terminology Extraction from Log Files

Thomas Ragni (Seco, CH): SAPS for choosing effective measures in Switzerland SAPS. Statistically Assisted Program Selection

Search Engines Chapter 2 Architecture Felix Naumann

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

Transcription bottleneck of speech corpus exploitation

AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Computer Aided Document Indexing System

LINGUISTIC SUPPORT IN "THESIS WRITER": CORPUS-BASED ACADEMIC PHRASEOLOGY IN ENGLISH AND GERMAN

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Local Culture in Global English:

Local Culture in Global English:

Customizing an English-Korean Machine Translation System for Patent Translation *

WebLicht: Web-based LRT services for German

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Machine Learning for natural language processing

Pumping up Moodle via Integrated Content Authoring, Sharing and Delivery Tools The Educanext LTI Case Study

CS 533: Natural Language. Word Prediction

Projektgruppe. Information Extraction An Incomplete Overview

Mining a Corpus of Job Ads

Micro blogs Oriented Word Segmentation System

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

SQS the world s leading specialist in software quality. sqs.com. SQS Testsuite. Overview

The Oxford Learner s Dictionary of Academic English

IRIS - English-Irish Translation System

Computer-aided Document Indexing System

Chapter 7. Language models. Statistical Machine Translation

Sketch Engine. Sketch Engine. SRDANOVIĆ ERJAVEC Irena, Web 1 Word Sketch Thesaurus Sketch Difference Sketch Engine

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF

PoS-tagging Italian texts with CORISTagger

Hybrid Strategies. for better products and shorter time-to-market

A History of the «Concise Oxford Dictionary»

Probability and statistical hypothesis testing. Holger Diessel

Word Completion and Prediction in Hebrew

Reliable and Cost-Effective PoS-Tagging

Teaching terms: a corpus-based approach to terminology in ESP classes

Chapter 5. Phrase-based models. Statistical Machine Translation

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

3 rd Young Researcher s Day 2013

Übungen zur Vorlesung Einführung in die Volkswirtschaftslehre VWL 1

bound Pronouns

A Mapping of CIDOC CRM Events to German Wordnet for Event Detection in Texts

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Term extraction for user profiling: evaluation by the user

CSCI 5417 Information Retrieval Systems Jim Martin!

PBS CBW NLS IQ Enterprise Content Store

Get the most value from your surveys with text analysis

Productions Management II

Getting Off to a Good Start: Best Practices for Terminology

Transforming and optimization of the supply chain to create value and secure growth and performance

Insights into Six Decades of Scientific Practice

DRAFT! c January 7, 1999 Christopher Manning & Hinrich Schütze Collocations

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

SQS-TEST /Professional

On the use of antonyms and synonyms from a domain perspective


c. hypermarkets d. supermarkets

Master-Programm Deutsch als Fremdsprache (Master of Arts Program in German as a Foreign Language) an der Ramkhamhaeng Universität/Bangkok

Multilingual Term Extraction as a Service from Acrolinx. Ben Gottesman Michael Klemme Acrolinx CHAT2013

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Collecting Polish German Parallel Corpora in the Internet

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna)

Transcription:

Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis 8.1 Cluster analysis 8.2 Co-occurrence 8.3 CCDB & IDS co-occurrence analysis 8.4 Searching for collocations Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1] word group analysis 8.1 Cluster analysis Cluster A cluster is a chain of linguistic entities. In er sprach vor einem großen Publikum, spr is a consonant cluster consisting of 3 consonants und sprach vor einem a word cluster consisting of 3 words. n-gram A n-gram is a sequence of n linguistic elements of the same type (Kunze & Lemnitzer 2007: 190) A 4-gram of words is a sequence of 5 words. A n-gram is the same as a n- cluster. The term n-gram is used in particular if all n-cluster are extracted from a corpus. Kunze, Claudia und Lothar Lemnitzer. Computerlexikographie. Eine Einführung. Tübingen: Narr [E-Book], 2007. S. 190. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 2] 1

1 Mongolia / Languages Search: clusters 2 Publishing out of 2 dictionaries words ending in off in part of the 3 Corpus linguistics English corpus of the LCC 4 Improving dictionaries 5 Outlook Search term position (here: on right) Search term (here: off) List of bi-grams with rank and fequency Sort (here: accord. to frequency of the cluster) Size of cluster (here: clusters out of two words) Frequency condition (here: at least three tokens) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 3] Co-occurrence 2.4 Co-occurrence Co-occurrence In a general sense, the term co-occurrence refers to the occurrence of two expressions close to each other. In a more specific sense, the term cooccurrence is used when the two expression occur more often together than can be expected if all words were distributed by chance. co-occurrence analysis the basic idea 1) Assumption: In a certain corpus, word X occurs a 1000 times, word Y a 100 times, word Z 10 times. 2) Probability: The combination XY is ten times as likely as the combination XZ. XY should occur ten times as often as XZ. 3) Observation: Actually, XZ occurs about as often as XY. 4) Conclusion: There is a close linguistic connection between X and Z (close beyond expectation). Kunze, Claudia und Lothar Lemnitzer. Computerlexikographie. Eine Einführung. Tübingen: Narr [E-Book], 2007. S. 391f. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 4] 2

1 Mongolia / Languages Search: co-occurrences for just in part 2 Publishing of the English dictionaries corpus of 3 Corpus the LCC. linguistics 4 Improving dictionaries 5 Outlook List of co-occurrence partner words with rank, frequency, and significance measure Search term (here: just) Definition of search context (here: up to 2 words after the search term) Sort (here: according to significance of co-occurrence) Frequency condition (here: at least 10 tokens) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 5] 8.3 CCDB & IDS co-occurrence analysis Co-occurrence analysis at the IDS Access: via COSMAS II WWW interface via COSMAS II client via CCDB (co-occurrence databasa) WWW interface and client: Co-occurrences are computed online (takes some time); several options for fine-tuning the analysis are available. CCDB: results of co-occurrence analyses are stored (fast access); no finetuning of analysis; automatic comparison of collocation profies available Quelle: Belica, Cyril: Kookkurrenzdatenbank CCDB. Eine korpuslinguistische Denkund Experimentierplattform für die Erforschung und theoretische Begründung von systemisch-strukturellen Eigenschaften von Kohäsionsrelationen zwischen den Konstituenten des Sprachgebrauchs. 2001-2007 Institut für Deutsche Sprache, Mannheim. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 6] 3

Anwendungsbeispiel II: Kookkurrenzen zu bestehen Question: co-occurrences for bestehen (in particular governed prepositions). 1 Mongolia Textkorpora / Languages 2 Publishing Recherchemethoden dictionaries 3 Corpus Anwendungen linguistics 4 Improving Rechercheprogramme dictionaries 5 Outlook Schlussbemerkung Co-occurrence analysis for bestehen as part of the CCDB (setting: do not ignore function words) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 7] Anwendungsbeispiel II: Kookkurrenzen zu bestehen Question: co-occurrences for bestehen (in particular governed prepositions). 1 Mongolia Textkorpora / Languages 2 Publishing Recherchemethoden dictionaries 3 Corpus Anwendungen linguistics 4 Improving Rechercheprogramme dictionaries 5 Outlook Schlussbemerkung Typical syntagmatic patterns in which the words co-occur, e. g. besteht aus [ ] [zwei drei] Teilen, consists of [ ] [two three] parts Secondary co-occurrence partners of bestehen + aus, here: aus Mitgliedern / Teilen / Ortsteilen bestehen, consist of members / parts / suburbs Primary co-occurrence partner of bestehen (here: aus) Strength of the connection (here: 40683) Co-occurrence analysis for bestehen as part of the CCDB (setting: do not ignore function words) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 8] 4

8.3 CCDB & IDS co-occurrence analysis Results (among others) aus: besteht [ ] aus ( consists of [ ] ) besteht [ ] aus [ ] Mitgliedern ( consists [ ] of [ ] members ) darin: besteht [ ] darin, dass ( is [ ] that ) die Schwierigkeit [ ] besteht [ ] darin, dass ( the difficulty [ ] is [ ] that ) darauf: besteht [ ] darauf, dass ( insists [ ] that ) er bestand [ ] darauf, dass ( he insisted [ ] that ) worin: worin [ ] besteht worin [ ] besteht der Unterschied zwischen ( what [ ] is the difference between ) governed preposition: auf, aus, in prepositions auf and in in particular as prepositional complement clauses preposition in often in interrogative sentences Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 9] 8.4 Searching for collocations Exploration of collocations and fixed expressions Article from a German-Mongolian dictionary (preliminary version). 20 Flaschen à 8 Euro, 20 bottles at 8 Euros each Task: Find relevant collocations and fixed expressions containing à. Procedure: 1) Retrieve concordances from a smaller corpus (AntConc with part of the German corpus from the Leipzig Corpus Collection). 2) Carry out co-occurrence analysis (CCDB, Deutsches Referenzkorpus ). Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 10] 5

8.4 Searching for collocations Concordances for à in a 1-million-RW selection of the German corpus within the LCC Fixed expression à la, after the fashion of (5 out of 10 hits) Fixed expression peu à peu, bit by bit (1 out of 10 hits) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 11] Co-occurrence analysis on the basis of the Deutsches Referenzkorpus (based on 2 bn. RW); COSMAS II WWW interface 1 Mongolia / Languages 2 Publishing dictionaries la as the most siginificant cooccurrence partner of à 3 Corpus linguistics (log likelihood ratio: 4 Improving 135300) dictionaries 5 Outlook Both collocations, à la and peu à peu are missing in the dictionary. peu as the second most siginificant co-occurrence partner of à (log likelihood ratio: 15974) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 12] 6

VICOMTE Kookkurrenzexplorer i) primary and secondary co-occurrence partner diagramed Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 13] ii) Co-occurrence partners can be annotated Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 14] 7

iii) co-occurrencepartners can be grouped Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 15] Perkuhn, Rainer: Systematic Exploration of Collocation Profiles. In: Proceedings of 4th Corpus Linguistics 2007, Birmingham. http://corpus.bham.ac.uk/corplingproceedings07/p aper/132_paper.pdf. iv) Refinement of description Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 16] 8