Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Similar documents
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

A chart generator for the Dutch Alpino grammar

Shallow Parsing with Apache UIMA

SAND: Relation between the Database and Printed Maps

The University of Amsterdam s Question Answering System at QA@CLEF 2007

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Interactive Dynamic Information Extraction

Text Generation for Abstractive Summarization

Example-based Translation without Parallel Corpora: First experiments on a prototype

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

CLARIN project DiscAn :

Development of a Dependency Treebank for Russian and its Possible Applications in NLP

Annotation Guidelines for Dutch-English Word Alignment

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

- Use of vocabulary and grammar in small conversation.

How To Write A Sentence In Germany

From D-Coi to SoNaR: A reference corpus for Dutch

Clustering Connectionist and Statistical Language Processing

Course description Course title: Dutch Language I: Introduction Course code: EN-IN-DLID Domein: Bewegen & Educatie > Education Objectives

Tools and resources for Tree Adjoining Grammars

Pronouns: A case of production-before-comprehension

Natural Language to Relational Query by Using Parsing Compiler

Cultural Trends and language change

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

INPUTLOG 6.0 a research tool for logging and analyzing writing process data. Linguistic analysis. Linguistic analysis

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

Dutch-Flemish Research Programme for Dutch Language and Speech Technology. stevin programme. project results

DEPENDENCY PARSING JOAKIM NIVRE

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Travelling features: Multiple sources, multiple destinations

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

A collaborative platform for knowledge management

Hybrid Strategies. for better products and shorter time-to-market

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Automatic Text Analysis Using Drupal

Statistical Machine Translation

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Degrees of adverbialization

Natural Language Database Interface for the Community Based Monitoring System *

TAAL THUIS DUTCH FOR BEGINNERS.

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Outline of today s lecture

Chapter 8. Final Results on Dutch Senseval-2 Test Data

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Monitoring system for OpenPBS environment

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Technologies for a CERIF XML based CRIS

Managing large sound databases using Mpeg7

Why and How to Differentiate Complement Raising from Subject Raising in Dutch

Learning Translation Rules from Bilingual English Filipino Corpus

Applying quantitative methods to dialect Dutch verb clusters

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Timeline (1) Text Mining Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Effective Self-Training for Parsing

Transition-Based Dependency Parsing with Long Distance Collocations

Natural answer presentation. through revision. of syntactic patterns

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Complement Raising, Extraction and Adpostion Stranding in Dutch

Language Model of Parsing and Decoding

How To Distinguish Between Extract From Extraposition From Extract

Search and Information Retrieval

GMP-Z Annex 15: Kwalificatie en validatie

Syntax: Phrases. 1. The phrase

Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian

Transcription:

Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde LREC 2012, Istanbul May 25, 2012

NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics September 2010 February 2012 Frank Van Eynde (CCL) and Hans Smessaert (NGTG) Liesbeth Augustinus and Vincent Vandeghinste (CCL) CLARIN Project Goals User-friendly tools Access to large data files Fast and accurate

NEDERBOOMS How can we combine the data-oriented approach of treebank mining with the knowledge-oriented method of theoretical and descriptive linguistics?

OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research

LASSY Written texts Wikipedia, books, newspapers, reports, websites, law texts... ALPINO parser (van Noord) Automatic syntactic annotation, XML-trees LASSY small 1 million words, 65200 sentences Manually corrected LASSY large 1.5 billion words Not corrected

ALPINO Dit is een zin. >> ALPINO parser >> This is a sentence.

QUERYING LASSY Existing search tools: dtsearch (Kloosterman 2007) Dact (de Kok 2010) query language: XPath = standard query language for xml trees

QUERYING LASSY Some examples: look for all NP nodes in which the head noun is modified by the adjective politiek, e.g. politieke discussies //node[@cat="np" and node[@rel="mod" and @pos="adj" and @root="politiek"] and node[@rel="hd" and @pos="noun"] look for verb clusters in which a separable verb particle occurs between two verb forms, e.g. Hij zegt dat ze spoedig zullen kennis maken //node[node[@rel="hd" and @pos="verb" and @begin <../node[@rel="vc"]/node[@rel="svp" and @pos="part"]/@begin] and node[@rel="vc" and node[@rel="svp" and @begin <../node[@rel="hd" and @pos="verb"]/@begin]]]

QUERYING LASSY XPath Not user-friendly Knowledge of Alpino grammar necessary = problematic for non-technical linguists Verify theory through data with corpus or treebank examples Time consuming, requires some effort How to make interaction between computational linguistics and theoretical linguistics possible?

OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research

GrETEL Greedy Extraction of Trees for Empirical Linguistics Search tool based on example sentences Input = natural language No explicit knowledge of formal query language nor Alpino grammar required Bridge gap between descriptive and computational linguistics Available online http://nederbooms.ccl.kuleuven.be (optimised for Mozilla Firefox)

GrETEL architecture

GrETEL - online

GrETEL - online

GrETEL - input Green versus red word order in Dutch green: past participle auxiliary De NAVO stelt dat ze er alles aan gedaan heeft red: auxiliary past participle De NAVO stelt dat ze er alles aan heeft gedaan The NATO claim that they have done everything in their power (deredactie.be)

GrETEL - input >> parsed with Alpino

GrETEL - annotation >> info added to Alpino parse

GrETEL query info input example Alpino parse query tree treebanks XPath query

GrETEL query tree

GrETEL query tree >> subtree extraction

GrETEL query tree query tree >> XPath generator >> XPath expression //node[@cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pos="verb"]] and node[@rel="hd" and @pos="verb" and @root="heb"]]

GrETEL treebanks LASSY Small in PostgreSQL database >> no local installation required

GrETEL results >> (adapted) query >> quantitative information

GrETEL results >> tree viewer >> list of results

GrETEL results Zodra de twee mannen hun financiële eisen bekend hadden gemaakt, werden ze in de boeien geslagen. Two men were captured as soon as they had announced their financial demands. (dpc-ind-001652-nl-sen.p.22.s.2) Een van de lessen die De Pouzilhac in zijn carrière in de reclame geleerd heeft, is dat research levensbelangrijk is. The importance of research is one of the lessons that De Pouzilhac has learned during his marketing career. (dpc-ind-001649-nl-sen.p.10.s.1) Both red and green word order in search results What if one is merely interested in green word order?

GrETEL word order What if one is merely interested in green word order? >> select the ordering filter XPath: //node[@cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pos="verb" and @begin <../../node[@rel="hd" and @pos="verb" and @root="heb"]/@begin]] and node[@rel="hd" and @pos="verb" and @root="heb"]]

OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research

CONCLUSION Nederbooms: Treebank mining for empirical linguistics Search tool for descriptive linguistics: GrETEL LASSY Treebank User-friendly: input = natural language XPath generator > knowledge of formal query language not required ordering filter Output = sample of similar sentences

FUTURE RESEARCH Speed up search query LASSY large (1.5 billion words) Varro (Martens 2010) Include CGN Automatically look for closely-related examples (more or less specific) CLAM webservice

Thanks for your attention! Questions? liesbeth@ccl.kuleuven.be http://nederbooms.ccl.kuleuven.be