Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde LREC 2012, Istanbul May 25, 2012
NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics September 2010 February 2012 Frank Van Eynde (CCL) and Hans Smessaert (NGTG) Liesbeth Augustinus and Vincent Vandeghinste (CCL) CLARIN Project Goals User-friendly tools Access to large data files Fast and accurate
NEDERBOOMS How can we combine the data-oriented approach of treebank mining with the knowledge-oriented method of theoretical and descriptive linguistics?
OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research
LASSY Written texts Wikipedia, books, newspapers, reports, websites, law texts... ALPINO parser (van Noord) Automatic syntactic annotation, XML-trees LASSY small 1 million words, 65200 sentences Manually corrected LASSY large 1.5 billion words Not corrected
ALPINO Dit is een zin. >> ALPINO parser >> This is a sentence.
QUERYING LASSY Existing search tools: dtsearch (Kloosterman 2007) Dact (de Kok 2010) query language: XPath = standard query language for xml trees
QUERYING LASSY Some examples: look for all NP nodes in which the head noun is modified by the adjective politiek, e.g. politieke discussies //node[@cat="np" and node[@rel="mod" and @pos="adj" and @root="politiek"] and node[@rel="hd" and @pos="noun"] look for verb clusters in which a separable verb particle occurs between two verb forms, e.g. Hij zegt dat ze spoedig zullen kennis maken //node[node[@rel="hd" and @pos="verb" and @begin <../node[@rel="vc"]/node[@rel="svp" and @pos="part"]/@begin] and node[@rel="vc" and node[@rel="svp" and @begin <../node[@rel="hd" and @pos="verb"]/@begin]]]
QUERYING LASSY XPath Not user-friendly Knowledge of Alpino grammar necessary = problematic for non-technical linguists Verify theory through data with corpus or treebank examples Time consuming, requires some effort How to make interaction between computational linguistics and theoretical linguistics possible?
OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research
GrETEL Greedy Extraction of Trees for Empirical Linguistics Search tool based on example sentences Input = natural language No explicit knowledge of formal query language nor Alpino grammar required Bridge gap between descriptive and computational linguistics Available online http://nederbooms.ccl.kuleuven.be (optimised for Mozilla Firefox)
GrETEL architecture
GrETEL - online
GrETEL - online
GrETEL - input Green versus red word order in Dutch green: past participle auxiliary De NAVO stelt dat ze er alles aan gedaan heeft red: auxiliary past participle De NAVO stelt dat ze er alles aan heeft gedaan The NATO claim that they have done everything in their power (deredactie.be)
GrETEL - input >> parsed with Alpino
GrETEL - annotation >> info added to Alpino parse
GrETEL query info input example Alpino parse query tree treebanks XPath query
GrETEL query tree
GrETEL query tree >> subtree extraction
GrETEL query tree query tree >> XPath generator >> XPath expression //node[@cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pos="verb"]] and node[@rel="hd" and @pos="verb" and @root="heb"]]
GrETEL treebanks LASSY Small in PostgreSQL database >> no local installation required
GrETEL results >> (adapted) query >> quantitative information
GrETEL results >> tree viewer >> list of results
GrETEL results Zodra de twee mannen hun financiële eisen bekend hadden gemaakt, werden ze in de boeien geslagen. Two men were captured as soon as they had announced their financial demands. (dpc-ind-001652-nl-sen.p.22.s.2) Een van de lessen die De Pouzilhac in zijn carrière in de reclame geleerd heeft, is dat research levensbelangrijk is. The importance of research is one of the lessons that De Pouzilhac has learned during his marketing career. (dpc-ind-001649-nl-sen.p.10.s.1) Both red and green word order in search results What if one is merely interested in green word order?
GrETEL word order What if one is merely interested in green word order? >> select the ordering filter XPath: //node[@cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pos="verb" and @begin <../../node[@rel="hd" and @pos="verb" and @root="heb"]/@begin]] and node[@rel="hd" and @pos="verb" and @root="heb"]]
OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research
CONCLUSION Nederbooms: Treebank mining for empirical linguistics Search tool for descriptive linguistics: GrETEL LASSY Treebank User-friendly: input = natural language XPath generator > knowledge of formal query language not required ordering filter Output = sample of similar sentences
FUTURE RESEARCH Speed up search query LASSY large (1.5 billion words) Varro (Martens 2010) Include CGN Automatically look for closely-related examples (more or less specific) CLAM webservice
Thanks for your attention! Questions? liesbeth@ccl.kuleuven.be http://nederbooms.ccl.kuleuven.be