Corpuslinguistics Annis 2 -Corpus query tool searching in Falko

Similar documents
Topological Field Chunking in German

NoSta-D: A Corpus of German Non-standard Varieties

German Language Resource Packet

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Machine Learning for natural language processing

LEJ Langenscheidt Berlin München Wien Zürich New York

Abstract. 1 Introduction. 2 Subcategorisation Extraction: Tool, Rules and Resource Database

CorA: A web-based annotation tool for historical and other non-standard language data

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

The ALeSKo learner corpus: Design annotation quantitative. [short title for running heads: The ALeSKo learner corpus]

Coffee Break German. Lesson 09. Study Notes. Coffee Break German: Lesson 09 - Notes page 1 of 17

Big Walnut High School: German II

Dutch Parallel Corpus

WebLicht: Web-based LRT services for German

Gerlang 1 Spring 2014 Tentativer Kursplan

Statistical Machine Translation

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

Pragmatic analysis of hotel websites in terms of interpersonal relationships. Theses of the PhD dissertation by. Kovács Péterné Dudás Andrea

Satzstellung. Satzstellung Theorie. learning target. rules

The University of Toronto. Fall 2009/German 100 Y

1003 Inhaltsverzeichnis

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Processing Dialogue-Based Data in the UIMA Framework. Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Coffee Break German Lesson 06

bound Pronouns

Curriculum Vitae Dr. Amir Zeldes

Johns Hopkins University German Elements I Intersession 2015 MTWTHFR: 1:00 3:30, ROOM TBA

Ling 201 Syntax 1. Jirka Hana April 10, 2006

The finite verb and the clause: IP

Syntax: Phrases. 1. The phrase

Probability and statistical hypothesis testing. Holger Diessel

The CroCo Translation Archive

PoS-tagging Italian texts with CORISTagger

Summary of Quota allocation as per

Year 7 outline GERMAN following Echo 1 All beginner groups

Exemplar for Internal Assessment Resource German Level 1. Resource title: Planning a School Exchange

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Machine Translation and the Translator

AP GERMAN LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

Comprendium Translator System Overview

GCE EXAMINERS' REPORTS

Lernen mit dem Power-Sprachkurs

GERMAN DIE SCHULE UND DIE KULTUR HOME LEARNING YEAR 7

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Text Generation for Abstractive Summarization

Natural Language Database Interface for the Community Based Monitoring System *

Modalverben Theorie. learning target. rules. Aim of this section is to learn how to use modal verbs.

Integration of Content Optimization Software into the Machine Translation Workflow. Ben Gottesman Acrolinx

FÉDÉRATION INTERNATIONALE DE GYMNASTIQUE. 22 nd FIG TRAMPOLINE GYMNASTICS WORLD AGE GROUP COMPETITIONS Sofia (BUL) November 2013.

ATLAS.ti 5 HyperResearch 2.6 MAXqda The Ethnograph 5.08 QSR N 6 QSR NVivo. Media types: rich text. Editing of coded documents supported

AP WORLD LANGUAGE AND CULTURE EXAMS 2012 SCORING GUIDELINES

Extraction and Visualization of Protein-Protein Interactions from PubMed

Introduction. Philipp Koehn. 28 January 2016

FEDERATION INTERNATIONALE DE GYMNASTIQUE

Annotation Guidelines for Dutch-English Word Alignment

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

A Mixed Trigrams Approach for Context Sensitive Spell Checking

0525 GERMAN (FOREIGN LANGUAGE)

Interactive Dynamic Information Extraction

How to make Ontologies self-building from Wiki-Texts

FOR TEACHERS ONLY The University of the State of New York

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Context Grammar and POS Tagging

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Exemplar for Internal Achievement Standard. German Level 1

Complex Predications in Argument Structure Alternations

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus

Student Booklet. Name.. Form..

Scheme of Work: German - Form /16

German Language Support Package

Course Outline and Syllabus Instructor: address: Phone: Staff Assistant A blue folder Office Hours: Text Materials PF 3640.C64, 1999.

Search Engines Chapter 2 Architecture Felix Naumann

TIGERSearch 2.1 User's Manual

Scheme of Work: German - Form 3

Blended Learning for institutions

Course: German 1 Designated Six Weeks: Weeks 1 and 2. Assessment Vocabulary Instructional Strategies

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Syntactic Transfer Using a Bilingual Lexicon

German II. Grades 9-12 CURRICULUM

Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse

Transcription:

Corpuslinguistics Annis 2 -Corpus query tool searching in Falko Marc Reznicek, Anke Lüdeling, Amir Zeldes, Hagen Hirschmann marc.reznicek@staff.hu-berlin.de... ánd many more members of the HU corpus team

Annis2 SFB 632 ANNotation of Information Structure (Dipper et al. 2004; Chiarcos et al. 2008; Zeldes et al. 2009) http://www.sfb632.uni-potsdam.de/d1/annis/ search engine for deeply annotated linguistic corpora token (-annotation) spans trees pointer node token token token POS POS POS edge token POS span span

web interface: query http://korpling.german.hu-berlin.de/falko-suche/search.html query window query checker & hit count selection of corpora (CTRL + click for multiple items) left and right context start query 2

web interface:hits partiture: learner text partiture TH1 more partitures 3

principie 1: variable-value pairs word = "System" variable1 ("word form") value word Sofern das System herrscht pos KOUS ART NN VVFIN lemma sofern d System herrschen 4

principie 1: variable-value pairs word = "System" the / this / that finds System (and nothing else) 5

principie 1: variable-value pairs pos = "NN" variable2 ("word form") value word Sofern das System herrscht pos KOUS ART NN VVFIN lemma sofern d System herrschen 6

principie 1: variable-value pairs pos = "NN" finds Riesen, Frauen, Student, giants, women, students 7

principie 1: variable-value pairs lemma = "System" variable3 ("lemma") value word Sofern das System herrscht pos KOUS ART NN VVFIN lemma sofern d System herrschen 8

principie 1: variable-value pairs lemma = "System" finds System, Systemen,Systems system.nom/dat/gen 9

searching strings Find all word forms "meinen" in FalkoEssayL2V2.3: word = "meinen" What do you find? 10

lemma "base forms" of words Find all forms of the verb "meinen": lemma = "meinen" problem: lemmatization is not always intuitive. example: lemma of sich 11

patterns (regular expressions) Annis allows pattern matching on all annotation layers for patterns use / / instead of " " Find all words including "mein" word = /.*mein.*/ 12

. patterns: Joker. an arbitrary character al. als, alt,..... two arbitrary characters al.. alle, alte, also... three arbitrary characters al alles, altes, alias,...

task What do you get for? word = /g.b./ 14

patterns:? and * + das? das* das+ the last character is optional f, s da, das the last character occurs 0 to times f, s, ss, da, das, dass, dasssssssss the last character occurs 1 to times s, ss, das, dass, dassssssssssss 15

task What happens, if you combine those operators? word = /Frau.?/ woman word = /Frau.*/ word = /Frau.+/ 16

task Try to find all word forms associated with base forms ending in -lang 17

task Try to find all word forms associated with base forms ending in -lang lemma = /.*lang/ hits e.g.: bislang lebenslang jahrelang 18

task Try to find all word forms associated with base forms starting in -lang lemma = /lang.+/ hits e.g.: lange langsam langweilig 19

alternatives: a or b = (a b) parentheses and ("or") allow for simultaneous search of different words: different forms: strings: substrings word = /(Mann Frau Kind)/ word = /(Mann Mannes)/ word=/bes(ser t).?/ 20

task Find all forms of the verb "meinen" in present tense below and only those. mein e mein st mein t mein en mein t mein en 21

solution word =/mein(s?t en?)/ or word =/mein(e st t en)/ Often there are different ways to get the same results 22

problem? word =/mein(s?t en?)/ or word =/mein(e st t en)/ 23

search for part-of-speech (POS) different types of POS-tag sets most German corpora use STTS ADJA ADV ART NN VVFIN attributives Adjektiv Adverb Artikel normales Nomen finites Verb http://www.ims.uni-stuttgart.de/projekte/corplex/tagsets/stts-table.html 24

Stuttgart-Tübingen-Tagset (STTS) ADJektive Noun Pronoun Verb ParTiKel KOnjunction ADJA NN PDS VVFIN PTKZU KOUI ADJD NE PDAT VVIMP PTKNEG KOUS PIS VVINF PTKVZ KON PIAT VVIZU PTKANT PIDAT VVPP PTKA PPER PPOSS PPOSAT PRELS PRELAT PRF PWS VAFIN VAIMP VAINF VAPP VMFIN VMINF VMPP PWAT PWAV

Stuttgart-Tübingen-Tagset (STTS) VERB full verb auxiliar verb modal verb finite VVFIN VAFIN VMFIN imperativ VVIMP VAIMP infinite VVINF VAINF VMINF Infinitive with zu VVIZU participle 2 VVPP VAPP VMPP

task Find all possessive pronouns pos =/PPOS(S AT)/ 27

principle II: relations single variable-value pairs are connected by & VV-pairs may not stay unrelated You refer to VV-pairs via # plus index. variable 1 = value 1 & variable 2 = value 2 & #1 relation #2 expression1: #1 expression 2: #2 28

principle II: relations variable 1 = value 1 & variable 2 = value 2 & #1 relation #2 expression 1: #1 expression 2: #2

30 search for order: e.g. noun following "zu" I am at home word = "zu" & pos = "NN" & #1.#2 Ich bin zu Hause PPER VAFIN APPR NN Satz Subj Adv

task: token order Find two successive adjectives attention: there are two types of adjectives ADJA & ADJD pos = /ADJ./ & pos = /ADJ./ & #1.#2 31

question How can I make sure that both adjectives belong to the same constituent? One adjective modifies the other? pos = /ADJ./ & pos = /ADJ./ & #1.#2 32

use target hypotheses differences between learner text (tok, ctok) and target hypotheses (ZH1,ZH2) are marked in (ZH1Diff, ZH2Diff). ZH1lemma weil sie ein Aspekt d Gesellschaft entdecken, ZH1Diff MOVS CHA CHA MOVT ZH1pos KOUS PPER ART NN ART NN VVPP $, ZH1 weil sie einen Aspekt der Gesellschaft entdeckt, tok weil sie entdeckt eine Aspekte der Gesellschaft, ZH1lemma wie d ander Frau ZH1Diff CHA ZH1pos KOKOM ART ADJA NN ZH1 wie die anderen Frauen tok wie die andere Frauen 33

edit tags ZHDiff INS DEL CHA MERGE SPLIT MOVS MOVT operation on target hypothesis token inserted token deleted token changed multiple tokens merged token split into multiple tokens token moved from here token moved here 34

task Find all reflexive pronouns missing in the learner text (ctok. ctokpos,ctoklemma) target hypotheses layers for TH1 are: ZH1, ZH1pos,ZH1lemma Differences are annotated on ZH1Diff, ZH1posDiff, ZH1lemmaDiff. ZH1pos="PRF" & ZH1Diff="INS"& #1_=_#2 35

task...all definite articles missing in the learner text. solution: ZH1lemma="d"& ZH1Diff="INS"& #1_=_#2 36

filter metadata text- & learner based metadata 37

filter metadata metadata: variables and values for text lerner SPK0: variable value 38

filter metadata metadata queries are formulated as meta::variable = "value" Find all word forms of lemma "Mann" written by females: sex. word="mann" & meta::sex="f" 39

filter metadata filter via L1 meta::reg=/l1:country code/ Find all form of the adjective "deutsch" in english texts: code= eng lemma="deutsch"& meta::reg=/l1:eng/ 40

filter metadata For a more specific selection of language acquisition orders, you need a.* between each VV-pair. meta::reg=/variable1:value1.*variable2:value2/ Find all word forms of "deutsch" in texts written by danish learners with L2 English before German. lemma="deutsch"& meta::reg=/l1:dan.*l2:eng.*l2:deu / 41

language codes in Falko (selection) niederländisch nor norwegisch pol polnisch rus russisch spa spanisch swe schwedisch tur türkisch ukr ukrainisch uzb usbekisch xho xhosa yid jiddisch zho zulu 42 afr afrikaans dan dänisch deu deutsch ell neugriechisch eng englisch fin finnisch fra französisch heb hebräisch hun ungarisch isl isländisch ita italienisch jpn japanisch lat lateinisch

example compare missing articles for spanish (spa) and danish (dan) speakers Important: how many articles are there in total? How many possible articles are missing? total amount of articles ZH1pos= "ART"& meta::reg=/l1:dan/ ZH1pos= "ART"& meta::reg=/l1:ita/ 43

Surpressing the value in a query includes all items annotated for a variable word counting tokens amount of tokens lemma amount of lemmas This way you can find the amount of tokens for a special L1 group word & meta::reg=/l1:ita/ 44

counting tokens How many tokens of japanese speakers are included in Falko? word & meta::reg=/l1:jpn/ 45

counting texts Each text starts with TXTstructure = "start " und ends mit TXTstructure = "end". How many texts of italian learners are included in Falko? TXTstructure="start" & meta::reg=/l1:ita/ 46

syntax (authentic) study question: Is acquisition of relatives independent of the grammatical function of the relative pronoun? hypotheses: 1. all syntactical functions for relatives are acquired simultaneously 2. Initially only prototypical functions for relatives are acquired. 47

most simple structure (subject) relative pronoun in subject position: POS=/V.*/ & POS=/PREL.*/ & #1 ->dep[deprel="subj"] #2 relative pronoun subject accusative object (=OBJA) or dative object (=OBJD) ZH1pos = /V.*/ & ZH1pos=/PREL.*/ & ZH1 & ZH1 & #1_=_#3 & #2_=_#4 & #3 ->dep[deprel="obja"] #4 dependencies relate only TH-layers directly 48

character sets[ ] With [ ] you can describe sets of allowed characters: [aeiou] simple vocals [A-Z] all capital letters between A and Z [0-9]+ - all hnumbers Beispiele: What kind of words do you find? lemma = /[A-Z][a-z]+-[A-Z][a-z]+/ 49

character sets[ ]:Umlaute [A-Za-zÄÖÜß] Umlaute special characters have to be mentioned expressively [A-Za-zÄÖÜäöüß0-9-] all charcters that can occur in a word lemma = /[A-ZÄÖÜ][a-zäöüß]+-[A-ZÄÖÜ][a-zäöüß]+/ 50

excluded character sets[^ ] [^ ] defines a set of characters which aren't allow to occur at the given position [^aeiouäöüaeiouäöü] no vocals [^äöüäöü] no Umlaute example: Find all word without "ß"! word=/[^ß]+/ 51

Search on multiple layers TOKEN TOKEN ANNOTATION SPAN 52

Search on multiple layers word Ich bin zu Hause pos PPER VAFIN APPR NN mode infostat accessible declarativ new 53

Search on multiple layers word = "zu" & pos = "APPR" & #1_=_#2 Ich bin zu Hause PPER VAFIN APPR NN declarativ accessible new expression 1: #1 54

Search on multiple layers word = "zu" & pos = "APPR" & #1_=_#2 Ich bin zu Hause PPER VAFIN APPR NN declarativ accessible new expression 2: #2 55

Search on multiple layers word = "zu" & pos = "APPR" & #1_=_#2 Ich bin zu Hause PPER VAFIN APPR NN declarativ accessible new Search for a token with the value "zu" for the variable "word", with the value "APPR" for the variable "pos". Both annotations shall refer to the same tokens. 56

Search on multiple layers word = "zu" & pos = "APPR" & istat ="new" & #1_=_#2 & #3_i_#1 Ich accessible bin zu Hause PPER VAFIN APPR NN declarativ #1 #3 new The token shall also be included in the span "new" on the infostatus layer. 57

summary - operators. arbitrary character * an arbitrary amount of the last character before + at least one instance of the character before? last character before is optional \ interpret next character as literary! not [abc] one element of the set [^abc] element except the ones in the set (a b) a or b a{2,3} 2 to 3 times "a" 58

summary relations operators on t oken relationa #1.#2 #1 directly followed by #2. #1.*#2 #1 indirectly followed by #2. #1_=_#2 #1 and #2 refer to the same token/span #1_i_#2 #1 is included in #2. 59

summary ANNIS ANNIS allows: (simultaneous) query of different corpora quantify results export results filter metadata 60

attention! corpora are always just samples of a language variety different corpora (may) give different results (sometimes) corpora include errors corpora can still help support or reject hypotheses 61

Thanks! Danke! 62

Literatur Lüdeling, Anke; Doolittle, Seanna; Hirschmann, Hagen; Schmidt, Karin; Walter, Maik (2008): Das Lernerkorpus Falko. In: Deutsch als Fremdsprache 45 (2), S. 67 73. Reznicek, Marc; Walter, Maik; Schmidt, Karin; Lüdeling, Anke; Hirschmann, Hagen; Krummes, Cedric; Andreas, Thorsten (2010): Das Falko-Handbuch. Korpusaufbau und Annotationen. Version 1.0. Berlin: Institut für deutsche Sprache und Linguistik, Humboldt- Universität zu Berlin. URL: http://www.linguistik.huberlin.de/institut/professuren/korpuslinguistik/forschung/falko [Stand: 12. Oktober 2010]. Zeldes, Amir; Ritz, Julia; Lüdeling, Anke; Chiarcos, Christian (2009): ANNIS. A Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus Linguistics 2009, Liverpool, July 20-23, 2009. 63