Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde
|
|
|
- Silas Weaver
- 10 years ago
- Views:
Transcription
1 Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde LREC 2012, Istanbul May 25, 2012
2 NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics September 2010 February 2012 Frank Van Eynde (CCL) and Hans Smessaert (NGTG) Liesbeth Augustinus and Vincent Vandeghinste (CCL) CLARIN Project Goals User-friendly tools Access to large data files Fast and accurate
3 NEDERBOOMS How can we combine the data-oriented approach of treebank mining with the knowledge-oriented method of theoretical and descriptive linguistics?
4 OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research
5 LASSY Written texts Wikipedia, books, newspapers, reports, websites, law texts... ALPINO parser (van Noord) Automatic syntactic annotation, XML-trees LASSY small 1 million words, sentences Manually corrected LASSY large 1.5 billion words Not corrected
6 ALPINO Dit is een zin. >> ALPINO parser >> This is a sentence.
7 QUERYING LASSY Existing search tools: dtsearch (Kloosterman 2007) Dact (de Kok 2010) query language: XPath = standard query language for xml trees
8 QUERYING LASSY Some examples: look for all NP nodes in which the head noun is modified by the adjective politiek, e.g. politieke discussies and and look for verb clusters in which a separable verb particle occurs between two verb forms, e.g. Hij zegt dat ze spoedig zullen kennis maken //node[node[@rel="hd" <../node[@rel="vc"]/node[@rel="svp" and node[@rel="vc" and node[@rel="svp" <../node[@rel="hd"
9 QUERYING LASSY XPath Not user-friendly Knowledge of Alpino grammar necessary = problematic for non-technical linguists Verify theory through data with corpus or treebank examples Time consuming, requires some effort How to make interaction between computational linguistics and theoretical linguistics possible?
10 OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research
11 GrETEL Greedy Extraction of Trees for Empirical Linguistics Search tool based on example sentences Input = natural language No explicit knowledge of formal query language nor Alpino grammar required Bridge gap between descriptive and computational linguistics Available online (optimised for Mozilla Firefox)
12 GrETEL architecture
13 GrETEL - online
14 GrETEL - online
15 GrETEL - input Green versus red word order in Dutch green: past participle auxiliary De NAVO stelt dat ze er alles aan gedaan heeft red: auxiliary past participle De NAVO stelt dat ze er alles aan heeft gedaan The NATO claim that they have done everything in their power (deredactie.be)
16 GrETEL - input >> parsed with Alpino
17 GrETEL - annotation >> info added to Alpino parse
18 GrETEL query info input example Alpino parse query tree treebanks XPath query
19 GrETEL query tree
20 GrETEL query tree >> subtree extraction
21 GrETEL query tree query tree >> XPath generator >> XPath expression and and and
22 GrETEL treebanks LASSY Small in PostgreSQL database >> no local installation required
23 GrETEL results >> (adapted) query >> quantitative information
24 GrETEL results >> tree viewer >> list of results
25 GrETEL results Zodra de twee mannen hun financiële eisen bekend hadden gemaakt, werden ze in de boeien geslagen. Two men were captured as soon as they had announced their financial demands. (dpc-ind nl-sen.p.22.s.2) Een van de lessen die De Pouzilhac in zijn carrière in de reclame geleerd heeft, is dat research levensbelangrijk is. The importance of research is one of the lessons that De Pouzilhac has learned during his marketing career. (dpc-ind nl-sen.p.10.s.1) Both red and green word order in search results What if one is merely interested in green word order?
26 GrETEL word order What if one is merely interested in green word order? >> select the ordering filter XPath: and and and and
27 OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research
28 CONCLUSION Nederbooms: Treebank mining for empirical linguistics Search tool for descriptive linguistics: GrETEL LASSY Treebank User-friendly: input = natural language XPath generator > knowledge of formal query language not required ordering filter Output = sample of similar sentences
29 FUTURE RESEARCH Speed up search query LASSY large (1.5 billion words) Varro (Martens 2010) Include CGN Automatically look for closely-related examples (more or less specific) CLAM webservice
30 Thanks for your attention! Questions?
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations
A chart generator for the Dutch Alpino grammar
June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
SAND: Relation between the Database and Printed Maps
SAND: Relation between the Database and Printed Maps Erik Tjong Kim Sang Meertens Institute [email protected] May 16, 2014 1 Introduction SAND, the Syntactic Atlas of the Dutch Dialects,
The University of Amsterdam s Question Answering System at QA@CLEF 2007
The University of Amsterdam s Question Answering System at QA@CLEF 2007 Valentin Jijkoun, Katja Hofmann, David Ahn, Mahboob Alam Khalid, Joris van Rantwijk, Maarten de Rijke, and Erik Tjong Kim Sang ISLA,
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
Text Generation for Abstractive Summarization
Text Generation for Abstractive Summarization Pierre-Etienne Genest, Guy Lapalme RALI-DIRO Université de Montréal P.O. Box 6128, Succ. Centre-Ville Montréal, Québec Canada, H3C 3J7 {genestpe,lapalme}@iro.umontreal.ca
Example-based Translation without Parallel Corpora: First experiments on a prototype
-based Translation without Parallel Corpora: First experiments on a prototype Vincent Vandeghinste, Peter Dirix and Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven Maria
Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es
KYOTO () Intelligent Content and Semantics Knowledge Yielding Ontologies for Transition-Based Organization http://www.kyoto-project.eu/ Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU
CLARIN project DiscAn :
CLARIN project DiscAn : Towards a Discourse Annotation system for Dutch language corpora Ted Sanders Kirsten Vis Utrecht Institute of Linguistics Utrecht University Daan Broeder TLA Max-Planck Institute
Development of a Dependency Treebank for Russian and its Possible Applications in NLP
Development of a Dependency Treebank for Russian and its Possible Applications in NLP Igor BOGUSLAVSKY, Ivan CHARDIN, Svetlana GRIGORIEVA, Nikolai GRIGORIEV, Leonid IOMDIN, Lеonid KREIDLIN, Nadezhda FRID
Annotation Guidelines for Dutch-English Word Alignment
Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
- Use of vocabulary and grammar in small conversation.
Course description Course title: Dutch Language II (Intermediate Basic) Course code: EN-IN-DLIM Domein: Bewegen & Educatie > Education Objectives - Understanding intermediate basic vocabulary: words (Dutch
How To Write A Sentence In Germany
Making Sense of Dutch Word Order Dick Grune [email protected] Version 1.2.1 (October 10, 2013) Note: Although Dutch and Flemish are officially the same language, the explanation below does not apply as
From D-Coi to SoNaR: A reference corpus for Dutch
From D-Coi to SoNaR: A reference corpus for Dutch N. Oostdijk, M. Reynaert, P. Monachesi, G. van Noord, R. Ordelman, I. Schuurman, V. Vandeghinste University of Nijmegen, Tilburg University, Utrecht University,
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Course description Course title: Dutch Language I: Introduction Course code: EN-IN-DLID Domein: Bewegen & Educatie > Education Objectives
Course description Course title: Dutch Language I: Introduction Course code: EN-IN-DLID Domein: Bewegen & Educatie > Education Objectives Understanding basic vocabulary: words (Dutch to English); Use of
Tools and resources for Tree Adjoining Grammars
Tools and resources for Tree Adjoining Grammars François Barthélemy, CEDRIC CNAM, 92 Rue St Martin FR-75141 Paris Cedex 03 [email protected] Pierre Boullier, Philippe Deschamp Linda Kaouane, Abdelaziz Khajour
Pronouns: A case of production-before-comprehension
Faculty of Arts University of Groningen Pronouns: A case of production-before-comprehension A study on the the comprehension and production of pronouns and reflexives in Dutch children Petra Hendriks Jennifer
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Cultural Trends and language change
Cultural Trends and language change Gosse Bouma [email protected] Information Science University of Groningen NHL 2015/03 Gosse Bouma 1/25 Popularity of Wolf in English books Gosse Bouma 2/25 Google Books
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects
INPUTLOG 6.0 a research tool for logging and analyzing writing process data. Linguistic analysis. Linguistic analysis
INPUTLOG 6.0 a research tool for logging and analyzing writing process data Linguistic analysis From character level analyses to word level analyses Linguistic analysis 2 Linguistic Analyses The concept
The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle
SCIENTIFIC PAPERS, UNIVERSITY OF LATVIA, 2010. Vol. 757 COMPUTER SCIENCE AND INFORMATION TECHNOLOGIES 11 22 P. The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle Armands Šlihte Faculty
Dutch-Flemish Research Programme for Dutch Language and Speech Technology. stevin programme. project results
Dutch-Flemish Research Programme for Dutch Language and Speech Technology stevin programme project results Contents Design and editing: Erica Renckens (www.ericarenckens.nl) Design front cover: www.nieuw-eken.nl
DEPENDENCY PARSING JOAKIM NIVRE
DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes
Ling 201 Syntax 1. Jirka Hana April 10, 2006
Overview of topics What is Syntax? Word Classes What to remember and understand: Ling 201 Syntax 1 Jirka Hana April 10, 2006 Syntax, difference between syntax and semantics, open/closed class words, all
Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu
Constructing a Generic Natural Language Interface for an XML Database Rohit Paravastu Motivation Ability to communicate with a database in natural language regarded as the ultimate goal for DB query interfaces
Travelling features: Multiple sources, multiple destinations
Travelling features: Multiple sources, multiple destinations Hendrik De Smet Freek Van de Velde (University of Leuven / Research Foundation Flanders FWO) Grammatical categories and features Grammatical
Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
K@ A collaborative platform for knowledge management
White Paper K@ A collaborative platform for knowledge management Quinary SpA www.quinary.com via Pietrasanta 14 20141 Milano Italia t +39 02 3090 1500 f +39 02 3090 1501 Copyright 2004 Quinary SpA Index
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark [email protected] Outline Flow chart Linguateca Palavras History
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded
Cambridge International Examinations Cambridge International General Certificate of Secondary Education
Cambridge International Examinations Cambridge International General Certificate of Secondary Education DUTCH 0515/01 Paper 1 Listening For Examination from 2015 SPECIMEN MARK SCHEME Approx. 45 minutes
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Degrees of adverbialization
Hendrik De Smet (KU Leuven) [email protected] Muriel Norde (Humboldt Universität zu Berlin) [email protected] Kristel Van Goethem (F.R.S.-FNRS & Université catholique de Louvain)
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
TAAL THUIS DUTCH FOR BEGINNERS. www.taalthuis.com
Sjoerd de Vos Marcel Heerink TAAL THUIS DUTCH FOR BEGINNERS www.taalthuis.com Introduction A book and a free course online There are two possibilities, and in both cases you will be surprised! The first
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural
Outline of today s lecture
Outline of today s lecture Generative grammar Simple context free grammars Probabilistic CFGs Formalism power requirements Parsing Modelling syntactic structure of phrases and sentences. Why is it useful?
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Monitoring system for OpenPBS environment
Monitoring system for OpenPBS environment V.Kolosov, E.Lublev, S.Makarychev ITEP, Moscow OpenPBS batch system is widely used in HEP community. Standard tools from OpenPBS package allow to check the current
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])
Parsing Technology and its role in Legacy Modernization. A Metaware White Paper
Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks
CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING
CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. Is there valuable
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.
DBMS Architecture INSTRUCTION OPTIMIZER Database Management Systems MANAGEMENT OF ACCESS METHODS BUFFER MANAGER CONCURRENCY CONTROL RELIABILITY MANAGEMENT Index Files Data Files System Catalog BASE It
Technologies for a CERIF XML based CRIS
Technologies for a CERIF XML based CRIS Stefan Bärisch GESIS-IZ, Bonn, Germany Abstract The use of XML as a primary storage format as opposed to data exchange raises a number of questions regarding the
Managing large sound databases using Mpeg7
Max Jacob 1 1 Institut de Recherche et Coordination Acoustique/Musique (IRCAM), place Igor Stravinsky 1, 75003, Paris, France Correspondence should be addressed to Max Jacob ([email protected]) ABSTRACT
Why and How to Differentiate Complement Raising from Subject Raising in Dutch
Why and How to Differentiate Complement Raising from Subject Raising in Dutch Frank Van Eynde Centre for Computational Linguistics, University of Leuven Liesbeth Augustinus Centre for Computational Linguistics,
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
Applying quantitative methods to dialect Dutch verb clusters
Applying quantitative methods to dialect Dutch verb clusters Jeroen van Craenenbroeck KU Leuven/CRISSP [email protected] 1 Introduction Verb cluster ordering is a well-known area of microparametric
Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen
Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen Date: 2010-01-11 Time: 08:15-11:15 Teacher: Mathias Broxvall Phone: 301438 Aids: Calculator and/or a Swedish-English dictionary Points: The
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?
Text Mining 2004-2005 Master TKI Antal van den Bosch en Walter Daelemans http://ilk.uvt.nl/~antalb/textmining/ Dinsdag, 10.45-12.30, SZ33 Timeline (1) [1 februari 2005] Introductie (WD) [15 februari 2005]
Effective Self-Training for Parsing
Effective Self-Training for Parsing David McClosky [email protected] Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - [email protected]
Transition-Based Dependency Parsing with Long Distance Collocations
Transition-Based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu (B), and Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,
Natural answer presentation. through revision. of syntactic patterns
University of Twente Faculty: Electrical Engineering, Mathematics & Computer Science Department: Computer Science Group: Language, Knowledge & Interaction Natural answer presentation through revision of
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
Complement Raising, Extraction and Adpostion Stranding in Dutch
Complement Raising, Extraction and Adpostion Stranding in Dutch Frank Van Eynde University of Leuven Liesbeth Augustinus University of Leuven Proceedings of the 21th International Conference on Head-Driven
Language Model of Parsing and Decoding
Syntax-based Language Models for Statistical Machine Translation Eugene Charniak ½, Kevin Knight ¾ and Kenji Yamada ¾ ½ Department of Computer Science, Brown University ¾ Information Sciences Institute,
How To Distinguish Between Extract From Extraposition From Extract
PP EXTRAPOSITION FROM NP IN DUTCH C. Jan-Wouter Zwart 1990 0. Summary* Movement to the right is very different from movement to the left. Whereas Wh-movement (extraction) of NP internal PPs is severely
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
GMP-Z Annex 15: Kwalificatie en validatie
-Z Annex 15: Kwalificatie en validatie item Gewijzigd richtsnoer -Z Toelichting Principle 1. This Annex describes the principles of qualification and validation which are applicable to the manufacture
Syntax: Phrases. 1. The phrase
Syntax: Phrases Sentences can be divided into phrases. A phrase is a group of words forming a unit and united around a head, the most important part of the phrase. The head can be a noun NP, a verb VP,
Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian
Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian Cristina Bosco, Alessandro Mazzei, and Vincenzo Lombardo Dipartimento di Informatica, Università di Torino, Corso Svizzera
