Treebank Search with Tree Automata MonaSearch Querying Linguistic Treebanks with Monadic Second Order Logic Authors: H. Maryns, S. Kepser Speaker: Stephanie Ehrbächer July, 31th
Treebank Search with Tree Automata MonaSearch Querying Linguistic Treebanks with Monadic Second Order Logic Content Treebanks Query language: Monadic Second Order Logic Query tool: MonaSearch
Part 1: Treebanks Part 1: Motivation and Treebanks
Treebanks Remember the part of our topic: corpus search corpus or text corpus (in linguistics): large structured set of texts usually electronically stored and processed What is a treebank? a text corpus each sentence parsed, annotated with syntactic structure syntactic structure represented as tree treebank parsed corpus treebank 4
Treebanks can be created completely manually: each sentence is annotated with syntactical structure by linguists semi automatically: syntactic structure is assigned by parser and checked by linguist; if necessary corrected Two main groups can be distinguished: treebanks that annotate phrase structure (e.g. the Penn Treebank) treebanks that annotate dependency structure (e.g. the Prague Dependency Treebank) 5
Examples for Treebanks TIGER treebank (http://www.ims.uni stuttgart.de/projekte/tiger) NEGRA (http://www.coli.uni saarland.de/projects/sfb378/negra corpus) TueBa D/S The Tuebingen Treebank of Spoken German (http://www.sfs.uni tuebingen.de/en_tuebads.shtml) TueBa D/Z The Tuebingen Treebank of Written German (http://www.sfs.uni tuebingen.de/en_tuebadz.shtml) Penn (http://www.cis.upenn.edu/~treebank) English Dependency Treebank (http://www.cis.upenn.edu/~creswell/dependency/) British Component of the International Corpus of English (ICE GB ; http://www.ucl.ac.uk.english usage/projects/ice gb) 6
Purpose of Treebanks Treebank can be used to investigate linguistic theories to study syntactic phenomena for training or testing parsers 7
Purpose of Treebanks Example problem: we want to investigate how often the order verb/subject occurs in German or English sentences (yes/no ques tions) What are we doing 8
Purpose of Treebanks Example problem: we want to investigate how often the order verb/subject occurs in German or English sentences (yes/no ques tions) What are we doing We investigate a treebank, but what do we have to mind 8
Treebanks Treebanks can be large: treebanks of several tens of thousends of trees are no exception manually searching not possible query tool necessary query tool has to have expressive power; reason: small answer sets 9
Overview MonaSearch: query tool for linguistic treebanks Query language monadic second order logic MSO MonaSearch MSO query TA treebank Queries are compiled into tree automata Each tree of the linguistic treebank is checked, if the TA of the query accepts it 10
PART 2: Query Language: Monadic Second Order Logic MSO PART 2: Query Language: Monadic Second Order Logic MSO
Monadic Second Order Logic MSO Here used as query language Decidable over trees Extension of first order predicate logic by set variables Set variables can be quantified over, representing (finite) sets of nodes 12
Monadic Second Order Logic MSO Here used as query language Decidable over trees Extension of first order predicate logic by set variables Set variables can be quantified over, representing (finite) sets of nodes Example: set variable P( x (x P)) There is an empty set. P( x (x P)) There is no empty set. 12
MSO Examples For all predicates holds: predicate holds for So krates or predicate does not hold for Sokrates. P((Sokrates P) (Sokrates P)) Peano's axiom of induction for natural numbers: For all predicates P with arity 1 holds: if P holds for 0 and if it holds with all element x also for the successor x' of x, then the predicate holds for all natural numbers. P((0 P) x((x P) (x' P)) x(x P)) 13
Monadic Second Order Logic MSO Monadic: remember that arity of predicates is 1 Counter example: P x y z(pxyz) Why MSO and not simply FOL? 14
Monadic Second Order Logic MSO Monadic: remember that arity of predicates is 1 Counter example: P x y z(pxyz) Why MSO and not simply FOL? Necessary expressive power Example: ability to express transitive closure of any binary relation that is definable in this language 14
Monadic Second Order Logic MSO Queries ending in a sequence of prepositional phrases: embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street ]]]]. independent:...[pp...][pp...][pp...][pp...] The dog buried the bone [pp with his paws ] [pp under a stone ] [pp behind the tree ] [pp in the afternoon ]. 15
Monadic Second Order Logic MSO Queries ending in a sequence of prepositional phrases: embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street ]]]]. independent:...[pp...][pp...][pp...][pp...] The dog buried the bone [pp with his paws ] [pp under a stone ] [pp behind the tree ] [pp in the afternoon ]. Consider extensions of arbitrary length of PPs. Can both queries be formulated in FOL now? Try it! 15
Monadic Second Order Logic MSO Queries ending in a sequence of prepositional phrases: embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street ]]]]. independent:...[pp...][pp...][pp...][pp...] The dog buried the bone [pp with his paws ] [pp under a stone ] [pp behind the tree ] [pp in the afternoon ]. Consider extensions of arbitrary length of PPs. Can both queries be formulated in FOL now? Try it! Which query cannot be expressed in FOL? 15
Monadic Second Order Logic MSO Queries ending in a sequence of prepositional phrases: embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street ]]]]. extension cannot be independent:...[pp...][pp...][pp...][pp...] expressed in FOL. Why? The dog buried the bone [pp with his paws ] [pp under a stone ] [pp behind the tree ] [pp in the afternoon ]. Consider extensions of arbitrary length of PPs. Can both queries be formulated in FOL now? Try it! Which query cannot be expressed in FOL? 15
Monadic Second Order Logic MSO independent:...[pp...][pp...][pp...][pp...] The dog buried the bone [pp with his paws ] [pp under a stone ] [pp behind the tree ] [pp in the afternoon ]... extension.... S NP VP D N V NP D N PP PP... PP The dog buried the bone with his paws under a stone...... FOL x: Px 16
Monadic Second Order Logic MSO embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street... extension... ]]]]. S NP VP D N V NP The dog buried D N PP the bone behind the tree PP in the garden PP 17
Monadic Second Order Logic MSO embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street... extension... ]]]]. Transitivity S cannot be expressed in NP VP FOL! D N V NP The dog buried D N PP the bone behind the tree PP in the garden PP 17
Monadic Second Order Logic MSO embedded:...[pp...[pp...[pp...[pp...]]]] The dog buried the bone [pp behind the tree [pp in the garden [pp in front of the house [pp at the end of the street... extension... ]]]]. Transitivity S cannot be expressed in NP VP FOL! D N V NP The dog buried D N PP But in MSO! the bone behind the tree PP remember Peano's axiom: P((0 P) x((x P) (x' P)) x(x P)) in the garden trans. closure of dom(pp,np): new relation highest PP and deeply embedded NP of this relation PP 17
S NP VP D N V NP The dog burried D N PP the bone P NP behind D N PP the tree P NP in D N PP the garden NP 17.a
WS2S Logic WS2S weak monadic second order theory of 2 successors obtained by WS1S: two successors (left and right) instead of one (+1) WS1S interpretation corresponds to strings, first order variable interpreted as natural number WS2S interpretation corresponds to finite labeled trees, first order variable interpreted as position in infinite binary tree MonaSearch resp. Mona can run either in linear mode or in tree mode 18
WS2S example Prefix ordering: k=2 x y= X.((y X ( z.( zi X) z X)) x X i=1 contains y closed by predecessor every set containing y and closed by predecessor contains x (here restricted for 2 successors) Note: can be expressed by WSkS formulas, thus it can actually be removed 19
Our Example in MSO Example problem: we want to investigate how often the order verb/subject occurs in German or English sentences (yes/no ques tions) x,y,z(cat(x)=simpx cat(y)=vxfin + + fct(z) = ON x y x z y <z) There exists a node with category SIMPX, a node with category VXFIN and a node with function ON... ON codes grammatical function subject in T Ba D/Z 20
MSO in the Querying process Strong connection between MSO and tree automata: MSO formula bottom up tree automata, that accepts the set of corresponding trees MSO formula automaton algorithm 21
MSO in the Querying process General evaluation stratey of MonaSearch: Step 1 Convert user query into TA Step2 Run TA on each tree in the tree bank 22
Part 3: Query Tool: Mona and Mona Search Part 3: Query Tool: MonaSearch l
What is MonaSearch? MonaSearch Mona MonaSearch query tool for linguistic treebanks query language MSO uses Mona 24
What is MonaSearch? MonaSearch Mona Mona tree automata toolkit that compiles MSO formulas in TA developed for hardware verification, but also applicable to query treebanks pure monadic second order logic of two successors, resp. one successor no extensions only binary trees no node labels 25
Strategy to employ MONA to query treebanks Mona Treebank MSO trees formula transfor formula compiler mation special variant of TA transformed trees library function: check each tree 26 output if formula satisfiable representation of a compiled automaton into a file
Strategy to employ MONA to query treebanks Mona Treebank MSO trees formula transfor formula compiler How to do this transformation? mation special variant of TA transformed trees library function: check each tree 27 output if formula satisfiable representation of a compiled automaton into a file
MonaSearch Precompilation of Treebanks Problems of arbitrary trees disconnected subparts root a b c d 28
MonaSearch Precompilation of Treebanks Problems of arbitrary trees disconnected subparts root a b c d Solution: Simplified structures integrate disconnected subparts by introducing a new virtual root; connect disconnected subparts to this super root super root root d a b c 28
MonaSearch Precompilation of Treebanks Example tree of TIGER corpus VROOT virtual root S OC VP OP MO NG VZ PM Damit sei jedoch nicht zu rechnen : PROAV VAFIN ADV PTKNEG PTKZU VVINF $. 3.Sg.Pres.Konj. damit sein jedoch nicht zu rechnen : 29
MonaSearch Precompilation of Treebanks Problems of arbitrary trees crossing edges x. y a b 30
MonaSearch Precompilation of Treebanks Problems of arbitrary trees crossing edges x. y a b Solution: Simplified structures ignore crossing edges; take only order of children as seen by the parents into account x y b a 30
MonaSearch Precompilation of Treebanks crossing edges 31 31
MonaSearch Precompilation of Treebanks Problems of arbitrary trees secondary relations x y z a b Solution: Simplified structures: ignore secondary relations x y z a b 32
MonaSearch Precompilation of Treebanks Tasks of precompilation: 1. trees which are arbitrarily branching have to be transformed in binary trees 2. linguistic labels have to be taken care of 33
MonaSearch Precompilation of Treebanks Tasks of precompilation: 1. trees which are arbitrarily branching have to be transformed in binary trees 2. linguistic labels have to be taken care of How can this be done? 33
MonaSearch Precompilation of Treebanks Tasks of precompilation: 1. trees which are arbitrarily branching have to be transformed in binary trees 2. linguistic labels have to be taken care of How can this be done? Use of First Child Next Sibling encoding Edge labels are moved down to node below it 33
MonaSearch Precompilation of Treebanks SIMPX KOORD LK ON PRED MOD MF VXFIN NX ADVX ADJX Oder ist Bremerhaven nicht günstiger? KON VAFIN NE PTKNEG ADJD $. Example tree of TüBa D/Z 34
First Child Next Sibling Encoding x in original tree > x' in the binary tree if x has any children, call its leftmost child y, then y' will become the left child of x' x x' y... y' if x has any right siblings, call the leftmost one z, then z' will become the right child of x' x z... x'... y y z' 35
Moving down of Edge Labels SIMPX KOORD LK ON PRED MOD MF VXFIN NX ADVX ADJX Oder ist Bremerhaven nicht günstiger? KON VAFIN NE PTKNEG ADJD $. 36
Moving down of Edge Labels SIMPX KOORD LK MF VXFIN NX ON ADVX MOD ADJX PRED KON VAFIN NE PTKNEG ADJD $. Oder ist Bremerhaven nicht günstiger? 37
Disconnected Subtrees SIMPX KOORD LK MF VXFIN NX ON ADVX MOD ADJX PRED KON VAFIN NE PTKNEG ADJD $. Oder ist Bremerhaven nicht günstiger? 38
Virtual Root (virtual root) $? SIMPX KOORD MF LK VXFIN NX ON ADVX MOD ADJX PRED KON VAFIN NE PTKNEG ADJD Oder ist Bremerhaven nicht günstiger 39
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD MF LK VXFIN NX ON ADVX MOD ADJX PRED KON VAFIN NE PTKNEG ADJD Oder ist Bremerhaven nicht günstiger 40
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD MF LK VXFIN NX ON ADVX MOD ADJX PRED KON VAFIN NE PTKNEG ADJD Oder ist Bremerhaven nicht günstiger 41
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD MF LK VXFIN NX ADVX ADJX ON MOD PRED KON VAFIN NE PTKNEG ADJD Oder ist Bremerhaven nicht günstiger 42
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD LK MF VXFIN KON VAFIN NX ON ADVX MOD ADJX PRED Oder ist NE PTKNEG ADJD Bremerhaven nicht günstiger 43
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD LK MF VXFIN KON VAFIN NX ON ADVX MOD ADJX PRED Oder ist NE PTKNEG ADJD Bremerhaven nicht günstiger 44
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD KON Oder LK MF VXFIN VAFIN NX ON ADVX MOD ADJX PRED ist NE PTKNEG ADJD Bremerhaven nicht günstiger 45
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD KON Oder LK MF VXFIN VAFIN NX ON ADVX MOD ADJX PRED ist NE PTKNEG ADJD Bremerhaven nicht günstiger 46
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD KON LK Oder MF VXFIN NX ADVX MOD ON VAFIN NE PTKNEG ist Bremerhaven nicht ADJX PRED ADJD günstiger 47
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD KON LK Oder MF VXFIN NX ADVX MOD ON VAFIN NE PTKNEG ist Bremerhaven nicht ADJX PRED ADJD günstiger 48
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD KON LK Oder MF VXFIN NX VAFIN NE ON ADVX ist Bremerhaven MOD PTKNEG ADJX PRED nicht ADJD günstiger 49
First Child Next Sibling Encoding (virtual root) $? SIMPX KOORD KON Oder LK MF VXFIN NX ON VAFIN NE ADVX MOD ist Bremerhaven ADJX PTKNEG PRED nicht ADJD günstiger Binary Coding 50
Steps of Querying in Detail translation ' 1. getting a MSO query on original treebank from the user; 2. translating the query into a MONA formula ' on binary trees 3. compiling the MONA formula into a MONA tree automaton 4. for each tree of the precompiled treebank: 1. preparing the tree for the query 2. running the automaton on the translated tree, noting wether it is accepted or not; 5. presenting the results to the user 51
Querying most connectives (boolean connectives, quantification, some atomic relations): Mona counterpart can be taken over directly problem: relations that express s.th. about the shape of the tree, e.g. dominance and precedence solution: auxiliary predicates on the binary tree, defined in MONA language dom(x,y) x dominates y in the binary tree right_branch(x,y) y lies at the branch of right children starting at x 52
Querying: Translation of the parent and precedence relation Relation Formula Translation parenthood x y ex1 z: x.0 = z & right_branch(z,y) precedence x < y ex1 z: (x = z dom(z.0,x)) & dom(z.1,y) + dominance x y ex1 z: x.0 = z & dom(z,y) 53
Querying: possible result of our example Example problem: we want to investigate how often the order verb/subject occurs in German or English sentences (yes/no ques tions) x,y,z(cat(x)=simpx cat(y)=vxfin + fct(z) = ON x + y x z y <z) 54
Querying: possible result of our example SIMPX KOORD LK ON PRED MOD MF VXFIN NX ADVX ADJX Oder ist Bremerhaven nicht günstiger? KON VAFIN NE PTKNEG ADJD $. Example tree of TüBa D/Z 55
Guided Tree Automata Question: Are normal bottom up tree automata sufficient for deciding validity and generating counter examples in WS2S? 56
Guided Tree Automata Question: Are normal bottom up tree automata sufficient for deciding validity and generating counter examples in WS2S? Answer: Theoretically yes. But: transition tables have an additional dimension compared to string automata > extra level of complexity 56
Guided Tree Automata Question: Are normal bottom up tree automata sufficient for deciding validity and generating counter examples in WS2S? Answer: Theoretically yes. But: transition tables have an additional dimension compared to string automata > extra level of complexity problem: state space explosions Mona solution: special kind of tree automata, called Guided Tree Automata 56
Guided Tree Automata Guide G = ( D,, ) top down deterministic TA ; states are used to d0 designate state space names of bottom up TA D finite set of state space names : D D D guide function d 0 D initial state space name Guided Tree Automaton GTA with guide G is a set of bottom MG up tree automata: M G = {Q d }d D,, { d }d D, {q d }d D, F guide function is used here How to use this guide? 57
Guided Tree Automata How to use this guide? Given: tree t; GTA accepts t: 1. State space is assigned to every node in t. Tree can be labeled top down with state spaces according to guide function; 2. Each subtree of the resulting tree is assigned a state in a bottom up style. GTA can be seen as ordinary tree automaton, where state space has been factorized according to the guide GTA with only one state space is an ordinary tree automaton 58
Guided Tree Automata Guide defined in the header with the guide construct Example: guide a >(b,c), b >(d,e), c >(c,c), d >(d,d), e >(e,f), f >(f,f) initial state space a,b,c,d,e state spaces (boolean state space, reserved for bool variable) Restricting variables to state spaces also by universes 59
Example: Exponential Savings with Guides ws2s; var2 A,B; ex1 p1,p2,p3,p4,p5: p1<p2 & p2<p3 & p3<p4 &p4<p5 & A = {p1,p2,p3,p4,p5}; ex1 p1,p2,p3,p4,p5,p6,p7: p1<p2 & p2<p3 & p3<p4 &p4<p5 & p5<p6 & p6<p7 & A = {p1,p2,p3,p4,p5,p6,p7}; ws2s; guide d0 >(a,b), a >(a,a), b >(b,b) universe ua:0, ub:1; var2 [ua] A; var2 [ub] B; ex1 [ua] p1,p2,p3,p4,p5: p1<p2 & p2<p3 & p3<p4 &p4<p5 & A = {p1,p2,p3,p4,p5}; ex1 [ub] p1,p2,p3,p4,p5,p6,p7: p1<p2 & p2<p3 & p3<p4 &p4<p5 & p5<p6 & p6<p7 & A = {p1,p2,p3,p4,p5,p6,p7}; 60
Performance: Comparison for TIGERSearch, fsq and MonaSearch 1. An NP dominating a S dominating a PP 2. An NP dominating a S dominating a PP and an NP, which do not dominate each other 3. Sentences where the verb precedes the subject 4. An NP not dominating a S which dominates a PP 5. A PP dominating a NP which is part of a chain of embedded PPs Query TIGERSearch fsq MonaSearch 1 5.5 5.5 23 13.5 15 10 2 9 5.5 23 13.5 15 10 3 15 16 23 13.5 15 10 4 23 13.5 15 10 5 15 10 red: TIGER treebank green: T Ba D/Z time in seconds 61
Summary We considered Treebanks Query language MSO Query tool MonaSearch
Conclusions MonaSearch very high expressive power by MSO high performance of query engine fastest query system for advanced queries
THANK YOU FOR YOUR ATTENTION
THANK YOU FOR YOUR ATTENTION QUESTIONS?