Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Size: px

Start display at page:

Download "Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde"

Silas Weaver
10 years ago
Views:

1 Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde LREC 2012, Istanbul May 25, 2012

2 NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics September 2010 February 2012 Frank Van Eynde (CCL) and Hans Smessaert (NGTG) Liesbeth Augustinus and Vincent Vandeghinste (CCL) CLARIN Project Goals User-friendly tools Access to large data files Fast and accurate

3 NEDERBOOMS How can we combine the data-oriented approach of treebank mining with the knowledge-oriented method of theoretical and descriptive linguistics?

4 OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research

5 LASSY Written texts Wikipedia, books, newspapers, reports, websites, law texts... ALPINO parser (van Noord) Automatic syntactic annotation, XML-trees LASSY small 1 million words, sentences Manually corrected LASSY large 1.5 billion words Not corrected

6 ALPINO Dit is een zin. >> ALPINO parser >> This is a sentence.

7 QUERYING LASSY Existing search tools: dtsearch (Kloosterman 2007) Dact (de Kok 2010) query language: XPath = standard query language for xml trees

8 QUERYING LASSY Some examples: look for all NP nodes in which the head noun is modified by the adjective politiek, e.g. politieke discussies and and look for verb clusters in which a separable verb particle occurs between two verb forms, e.g. Hij zegt dat ze spoedig zullen kennis maken //node[node[@rel="hd" <../node[@rel="vc"]/node[@rel="svp" and node[@rel="vc" and node[@rel="svp" <../node[@rel="hd"

9 QUERYING LASSY XPath Not user-friendly Knowledge of Alpino grammar necessary = problematic for non-technical linguists Verify theory through data with corpus or treebank examples Time consuming, requires some effort How to make interaction between computational linguistics and theoretical linguistics possible?

10 OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research

GrETEL Greedy Extraction of Trees for Empirical Linguistics Search tool based on example sentences Input = natural language No explicit knowledge of formal query language

11 GrETEL Greedy Extraction of Trees for Empirical Linguistics Search tool based on example sentences Input = natural language No explicit knowledge of formal query language nor Alpino grammar required Bridge gap between descriptive and computational linguistics Available online (optimised for Mozilla Firefox)

12 GrETEL architecture

13 GrETEL - online

14 GrETEL - online

15 GrETEL - input Green versus red word order in Dutch green: past participle auxiliary De NAVO stelt dat ze er alles aan gedaan heeft red: auxiliary past participle De NAVO stelt dat ze er alles aan heeft gedaan The NATO claim that they have done everything in their power (deredactie.be)

16 GrETEL - input >> parsed with Alpino

17 GrETEL - annotation >> info added to Alpino parse

18 GrETEL query info input example Alpino parse query tree treebanks XPath query

19 GrETEL query tree

20 GrETEL query tree >> subtree extraction

21 GrETEL query tree query tree >> XPath generator >> XPath expression and and and

22 GrETEL treebanks LASSY Small in PostgreSQL database >> no local installation required

23 GrETEL results >> (adapted) query >> quantitative information

24 GrETEL results >> tree viewer >> list of results

25 GrETEL results Zodra de twee mannen hun financiële eisen bekend hadden gemaakt, werden ze in de boeien geslagen. Two men were captured as soon as they had announced their financial demands. (dpc-ind nl-sen.p.22.s.2) Een van de lessen die De Pouzilhac in zijn carrière in de reclame geleerd heeft, is dat research levensbelangrijk is. The importance of research is one of the lessons that De Pouzilhac has learned during his marketing career. (dpc-ind nl-sen.p.10.s.1) Both red and green word order in search results What if one is merely interested in green word order?

26 GrETEL word order What if one is merely interested in green word order? >> select the ordering filter XPath: and and and and

27 OUTLINE The Nederbooms project Querying LASSY Existing tool & querying issues GrETEL Conclusion and future research

28 CONCLUSION Nederbooms: Treebank mining for empirical linguistics Search tool for descriptive linguistics: GrETEL LASSY Treebank User-friendly: input = natural language XPath generator > knowledge of formal query language not required ordering filter Output = sample of similar sentences

29 FUTURE RESEARCH Speed up search query LASSY large (1.5 billion words) Varro (Martens 2010) Include CGN Automatically look for closely-related examples (more or less specific) CLAM webservice

30 Thanks for your attention! Questions?

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations