The Use of Text Corpora in Lexical Research



Similar documents
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Complex Predications in Argument Structure Alternations

The University of Toronto. Fall 2009/German 100 Y

bound Pronouns

Exemplar for Internal Achievement Standard. German Level 1

A model for corpus-driven exploration and presentation of multi-word expressions

Language considerations for developing VoiceXML in German

Data at the SFB "Mehrsprachigkeit"

German Language Resource Packet

All the English here is taken from students work, both written and spoken. Can you spot the errors and correct them?

Coffee Break German. Lesson 03. Study Notes. Coffee Break German: Lesson 03 - Notes page 1 of 15

Coffee Break German Lesson 06

Electrophysiology of language

Exemplar for Internal Assessment Resource German Level 1. Resource title: Planning a School Exchange

Student Booklet. Name.. Form..

Psychology G4470. Psychology and Neuropsychology of Language. Spring 2013.

Elena Chiocchetti & Natascia Ralli (EURAC) Tanja Wissik & Vesna Lušicky (University of Vienna)

For those of you keen to have a written version on the podcast, here is the script below:

Making a Dictionary in Ulaanbaatar:

Name: Klasse: Datum: A. Was wissen Sie schon? What do you know already from studying Kapitel 1 in Vorsprung? True or false?

Support verb constructions

2 Computer Science and Information Systems Research Projects

Using German corpora for linguistic purposes. Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim

Targeted Advertising and Consumer Privacy Concerns Experimental Studies in an Internet Context

Satzstellung. Satzstellung Theorie. learning target. rules

1 von 91 RMS WiSe 2014/15/Academic Working/Seiten/Startseite

language-related erp components: n400

APPLICATION FOR ADMISSION (For further information please visit our web site

Corpus-driven study of multi-word expressions based on collocations from a very large corpus

FOR TEACHERS ONLY The University of the State of New York

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

Paul Kussmaul. Becoming a competent translator in a B.A. course. 1. Introduction

How to Design a Scientific Poster

Coffee Break German. Lesson 09. Study Notes. Coffee Break German: Lesson 09 - Notes page 1 of 17

Modalverben Theorie. learning target. rules. Aim of this section is to learn how to use modal verbs.

1003 Inhaltsverzeichnis

What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project

RISK MANAGEMENT IN COMPANIES A QUESTIONNAIRE AS AN INSTRUMENT FOR ANALYSING THE PRESENT SITUATION

AP WORLD LANGUAGE AND CULTURE EXAMS 2012 SCORING GUIDELINES

Checklist Use this checklist to find out how much English you already know. Grundstufe 1 (Common European Framework: A1 Level)

Language Technology II Language-Based Interaction Dialogue design, usability,evaluation. Word-error Rate. Basic Architecture of a Dialog System (3)

What Are Standards Of Rigor For Qualitative Research? Gery W. Ryan RAND Corporation

Call # Title G/V -1 Orden fur Wunderkinder S/F NTSC N/A N/A N/A N/A Cabinet B. S/F NTSC English N/A C 120 Cabinet B

Master of Arts in Linguistics Syllabus

Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories

The finite verb and the clause: IP

SEMINAR. Patenting Software and Computer-Related Inventions. Recent Developments in the U.S. and Europe

The Relationship Between Scrolling, Negotiation, and Self-Initiated Self-Repair in an SCMC Environment

Master of Arts in Business Education (MA) 29 January Module 1 Introduction to Business Education (6 ECTS) Content. Learning Outcomes F01 BE01

Review Protocol Agile Software Development

Search Engines Chapter 2 Architecture Felix Naumann

GERMAN WORD ORDER. Mihaela PARPALEA 1

Multipurpsoe Business Partner Certificates Guideline for the Business Partner

Overcoming Language Barriers in Homecare Nursing (OLBiHN)

Dr. Reynaldo Valle Thiele

Hänsel und Gretel Theaterstück

Linguistics & Cognitive Science

Simple maths for keywords

CHAPTER THREE: METHODOLOGY Introduction. emerging markets can successfully organize activities related to event marketing.

Annotation Guidelines for Dutch-English Word Alignment

EFL Learners Synonymous Errors: A Case Study of Glad and Happy

SYNTAX AND SEMANTICS OF CAUSAL DENN IN GERMAN TATJANA SCHEFFLER

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

AP GERMAN LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Scope Transcription of (video) interviews conducted as part of the OHP

to Automatic Interpreting Birte Schmitz Technische Universitat Berlin

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

MA APPLIED LINGUISTICS AND TESOL

The German response particle doch as a case of contrastive focus

An Overview of Applied Linguistics

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Teaching terms: a corpus-based approach to terminology in ESP classes

Brauchen die Digital Humanities eine eigene Methodologie?

Varieties of specification and underspecification: A view from semantics

Microsoft Certified IT Professional (MCITP) MCTS: Windows 7, Configuration ( )

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Schneps, Leila; Colmez, Coralie. Math on Trial : How Numbers Get Used and Abused in the Courtroom. New York, NY, USA: Basic Books, p i.

GERMAN DIE SCHULE UND DIE KULTUR HOME LEARNING YEAR 7

INFORMATIONEN (nicht per )

GCE EXAMINERS' REPORTS

TRANSCRIPTION OF GERMAN INTONATION THE STUTTGART SYSTEM

German for beginners in 7 lessons

Expensive, not expensive or cheap?

Successful Collaboration in Agile Software Development Teams

0525 GERMAN (FOREIGN LANGUAGE)

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

LEJ Langenscheidt Berlin München Wien Zürich New York

COMPARATIVES WITHOUT DEGREES: A NEW APPROACH. FRIEDERIKE MOLTMANN IHPST, Paris fmoltmann@univ-paris1.fr

Blended Learning for institutions

Superiority: Syntax or Semantics? Düsseldorf Jul02. Jill devilliers, Tom Roeper, Jürgen Weissenborn Smith,Umass,Potsdam

Customer Intimacy Analytics

Reference Determination for Demonstrative Pronouns

Indiana University East Faculty Senate

Dialogue Activities. Copyright Goethe-Institut London 1 Alle Rechte vorbehalten

Virtual Organization Virtuelle Fabrik

Guidelines for Masters / Magister / MA Theses

User oriented communication of risks in online help systems

MUSTER. ENGLISH G 21/D3 Test No. 1 Unit 1: My London. 1 LISTENING The London Eye. G - Level: Listen to three texts and tick the correct box.

Transcription:

The Use of Text Corpora in Lexical Research Stefan Engelberg Workshop, Universitatea din Bucureşti, November 2008 http://www.ids-mannheim.de/ll/lehre/engelberg/ Webseite_CorpLex/CorpLex.html engelberg@ids-mannheim.de Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1] Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis Corpus analysis software AntConc COSMAS II Corpus Browser (Leipzig) DWDS corpora & analysis KWIC Finder Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 2] 1

Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis 1.1 Empiricism Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 3] Empiricism in philosophy 1.1 Empiricism Empiricism Empiricism is an epistemological position that emphasizes that nothing can be known unless it is substantiated by information we gain from the senses. All concepts are based on experience. All true statements are based on experience or are logically derived from statements that are based on experience. (E. g., TIME is a concept derived from the nature of observable events.) Rationalism Empiricism is an epistemological position that emphasizes that knowledge can be gained from reasoning based (at least partly) on concepts that are given a priori. Basic concepts are innate. (Some) true statements can be made without recurrence to experience. (E. g., TIME is an innate concept, prior to experience.) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 4] 2

Empiricism in science 1.1 Empiricism Empircal research Empirical research bases all its findings on direct or indirect observation. It proceeds inductivly from data to theory formation. It tends to be data-driven, i.e. based on data that are analyzed with few preconceived assumptions about their structure. (E. g., in linguistics, it is rather based on extensive corpus linguistic research or psycholinguistic experiments, than on single, subjective grammaticality judgements.) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 5] The END of a scientific investigation Theory A theory is a set of connected statements about a part of reality. The set of statements includes descriptions, explanations and laws. A theory is falsifiable and it allows to make predictions about phenomena which have not been obeserved before. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 6] 3

The BEGINNING of a scientific investigation Wondering Wonder [ ] and not any expectation of advantage from its discoveries, is the first principle which prompts mankind to the study of Philosophy, of that science which pretends to lay open the concealed connections that unite the various appearance of nature; and they pursue this study for its own sake, as an original pleasure or good in itself, without regarding its tendency to procure them the means of many other pleasures. (Section III). (Adam Smith, The History of Astronomy, Setion III) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 7] The WAY of empirical investigations Wondering wondering about a particular state of affairs; a Why - question Inquiry examination of current theories about the phenomenon Hypothesis explicit formulation of a statement about the phenomenon, that can be justified and is presumed to be true Research design Design of an empirical investigation falsification acquisiton of further data Theory interpretation of data, formulation of theoretical statements Data analysis classification, typology, qualitative analysis, Data processing transcription, collection of data in data bases, clearing up of corpus data, Data acquisition questionnaire, psycholinguistic experiment, corpus study, Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 8] 4

Two types of empirical investigations Testing hypotheses (cf. previous slide): Starting point: a hypothesis derived from a theory, a preparative study, or a personal conviction. Result: the confirmation or falsification of the hypothesis. Exploring hypotheses: Starting point: the investigation and systematization of data, from which hypotheses about the general nature of the phenomena can be derived. Result: scientific hypotheses. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 9] Data are state of affaires that can be observed and that are used in order to argue for or against theoretical assumptions. Linguists use different types of data which are gained in very different ways: (I) Intreospective grammaticality judgements As a native speaker of German I judge the following sentences as indicated: *Peter hilft seinen Freund Peter hilft seinem Freund Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 10] 5

(II) Grammaticality judgements on the basis of operational tests Verb phrases of a particular semantic type (so-called accomplishments) can be detremined by a span-of-time test; they can be combined with an adverbial of the type in x Tagen / Stunden / Minuten: (1) Er hat den Wagen in drei Stunden repariert. (2) *Er hat den Mechaniker in drei Stunden gehasst. Vendler, Zeno (1957): Verbs and Times. In: The Philosophical Review LXVI, 143-60. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 11] (III) Graded grammaticality judgements on the basis of larger number of native speakers On an acceptability scale from 1 (entirely ok) till 5 (unintelligible) 180 subjects judged the following sentence as 2,9 on average: Das echte Überraschen der Kinder beim Anblick der Eltern war niemandem entgangen. Blume, Kerstin (2004): Nominalisierte Infinitive. Eine empirisch basierte Studie zum Deutschen. Tübingen: Niemeyer. S. 69. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 12] 6

(IV) Results of psycholinguistic experiments If two words are presented in succession and subjects have to decide as fast as possible whether the second word is a word of German, they solve this task faster in case (2) than in case (1) : (1) Fahrrad Doktor. (2) Krankenschwester Doktor. Linke, Angelika, Markus Nussbaumer & Paul R. Portmann (1994). Studienbuch Linguistik. 2. Aufl. Tübingen: Niemeyer. S. 342. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 13] (V) Scans on the basis of imaging techniques When trilingual subjects were asked to correct grammatical errors they exhibite brain acticity in different areas of the brain (language 1: orange; language 2: blue; language 3: green). The left image shows the activities in the Broca area of a subject who learnt two of the three languages bevor the age of three (one network for the two languages), the right image the activities of a subject who learnt two of the three languages after the age of ten (one network for each language): Kramer, Katharina (2003): Wie Werde ich ein Sprachgenie? In: Gehirn & Geist 2, 48-50. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 14] 7

(VI) Results of electroencephalography (EEGs) The EEG measurement relative to a particular event (event-correlated potential), namely the presentation of the sentences (1) through (3), shows differences in the area of the N400 component: (1) The pizza was too hot to eat. (2) The pizza was too hot to drink. (3) The pizza was too hot to cry. Kutas, M. & C. Van Petten (1994): Psycholinguistics electrified: event-related brain potential investigations. In: Gernsbacher, M.A. (Ed.): Handbook of Psycholinguistics. San Diego: Academic Press, 83-143. Nach: http://www.uni-bielefeld.de/lili/projekte/neuroling/neurolinguistics.html. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 15] (VII) Utterances of speakers with neurological impairment Dialogue between a patient (40 years old, housewife, stroke) and a (female) examiner: P: äh / is vor drei Jahren Unfall / äh / im Auto passiert / U: ein Unfall? / P: ja / im / äh / Stuttgart / U: ahja / P: und zehn Tage später / äh / Sprache weg / äh / ich / zehn Tage /äh / blaß und / äh / immer Kopfschmerzen / und am / am morgens steht mir / äh / ich / äh / am morgens steht mir / äh / am morgens aufst / aufgestanden / die Kinder geweckt / und mein Mann / und in die Küche / äh / das Brotmesser und / äh / äh / Brot schneiden / und der Brotmesser aus der Hand fallt / ach Gott! / Brotmesser aufgenommen / also der is nicht.../ he?! /und / ich hab ge / äh / gesprochen! / äh / fünf Minuten später / gar nix! / und dann.../ U: Sind Sie ohnmächtig dann geworden? P: Nein! / mein Mann / äh / is / äh / ich in Bett / ge / gegegangen und / äh / und / äh / der Arzt gerufen / und / äh / aufschreiben / gar nix / weg! / und dann dann / äh / ohnmä / ne nich / ohnmächtig! / das is nicht / äh / so / äh / so gef / äh / ganz / ganz / weit weg / Gedanken // ja und dann.../ Peuser, Günter (1978): Aphasie. Eine Einführung in die Patholinguistik. München: Fink. S. 408f. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 16] 8

(VIII) Utterances of (healthy) speakers from electronic text corpora Concordance for besteht, created by AntConc, version 3.2.1w, on the basis of part of the German newspaper corpus within the Leipzig Corpus Collection: Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 17] (IX) Statistical measurements Statistical computation of the collocation potential of the word bestehen on the basis of the Deutsches Referenzkorpus ; excerpt from the CCDB (cooccurrence database). Belica, Cyril: Kookkurrenzdatenbank CCDB. 2001-2007 Institut für Deutsche Sprache, Mannheim. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 18] 9