Semantic Tree Kernels to classify Predicate Argument Structures

Transcription

1 emantic Tree Kernels to classify Predicate Argument tructures Alessandro Moschitti 1 and Bonaventura Coppola 2 and Daniele Pighin 3 and Roberto Basili 4 Abstract. Recent work on emantic Role Labeling (RL) has shown that syntactic information is critical to detect and extract predicate argument structures. As syntax is expressed by means structured data, i.e. parse trees, its encoding in learning algorithms is rar complex. In this paper, we apply tree kernels to encode whole predicate argument structure in upport Vector Machines (VMs). We extract from sentence syntactic parse subtrees that span potential argument structures target predicate and classify m in incorrect or correct structures by means tree kernel based VMs. Experiments on PropBank collection show that classification accuracy correct/incorrect structures is remarkably high and helps to improve accuracy RL task. This is an evidence that tree kernels provide a powerful mechanism to learn complex relation between syntax and semantics. 1 TRODUCTION The design features for natural language processing tasks is, in general, a critical problem. The inherent complexity linguistic phenomena, ten characterized by structured data, makes difficult to find effective attribute-value representations for target learning models. In many cases, traditional feature selection techniques [8] are not very useful since critical problem relates to feature generation rar than selection. For example, design features for a natural language syntactic parse-tree re-ranking problem [2] cannot be carried out without a deep knowledge about automatic syntactic parsing. The modeling syntax/semantics-based features should take into account linguistic aspects to detect interesting context, e.g. ancestor nodes or semantic dependencies [15]. A viable alternative has been proposed in [3], where convolution kernels were used to implicitly define a tree substructure space. The selection relevant structural features was left to voted perceptron learning algorithm. uch successful experimentation shows that tree kernels are very promising for automatic feature engineering, especially when available knowledge about phenomenon is limited. In a similar way, automatic learning tasks that rely on syntactic information may take advantage a tree kernel approach. One such tasks is emantic Role Labeling (RL), as defined e.g. in [1] over PropBank corpus[7]. Most literature work models RL as classification tree nodes sentence parse containing target predicate. Indeed, a node can uniquely determine set 1 University Rome Tor Vergata, moschitti@info.uniroma2.it 2 ITC-Irst and University Trento, coppolab@itc.it 3 University Rome Tor Vergata, daniele.pighin@gmail.com 4 University Rome Tor Vergata, basili@info.uniroma2.it words that compose an argument (boundaries) and provide along with local tree structure information useful to classification role. Accordingly, most RL systems split labeling process into two different steps: Boundary Detection (i.e. determine text boundaries predicate arguments) and Role Classification (i.e. labeling such arguments with a semantic role, e.g. Arg0 or Arg1). Both above steps require design and extraction features from parse tree. Capturing complex interconnected relationships amongst a predicate and its arguments is a hard task. To decrease such complexity we can design features considering a predicate with only one argument at a time, but this limits our ability to capture semantics whole predicate structure. An alternative approach to engineer syntactic features is use tree kernels as substructures that y generate potentially correspond to relevant syntactic clues. In this paper, we use tree kernels to model classifiers that decide if a predicate argument structure is correct or not. We apply a traditional boundary classifier (TBC) [11] to label all parse tree nodes that are potential arguments, n we classify syntactic subtrees which span predicate-argument dependencies, i.e. Predicate Argument panning Trees (PATs). ince design effective features to encode such information is not simple, tree kernels are a very useful method. To validate our approach we experimented tree kernels with upport Vector Machines for classification PATs. The results show that this classification problem can be learned with high accuracy (about 88% F 1-measure 5 ) and impact on overall RL labeling accuracy is also relevant. The paper is organized as follows: ection 2 introduces emantic Role Labeling based on VMs and tree kernel spaces; ection 3 formally defines PATs and algorithm to classify m; ection 4 shows comparative results between our approach and traditional one; ection 5 presents related work; and finally, ection 6 summarizes conclusions. 2 EMANTIC ROLE LABELG In last years, several machine learning approaches have been developed for automatic role labeling, e.g. [5, 11]. Their common characteristic is adoption attribute-value representations for predicate-argument structures. Accordingly, our basic system is similar to one proposed in [11] and it is hereby described. We use a boundary detection classifier (for any role type) to derive words compounding an argument and a multiclassifier to assign role (e.g. ARG0 or ARGM) described in PropBank [7]). To prepare training data for both classifiers, we used following algorithm: 1. Given a sentence from training-set, generate a full syntactic 5 F 1 assigns equal importance to Precision P and Recall R, i.e. F 1 = 2P R P +R.

2 entence Parse-Tree took{arg0, ARG1} {ARG0, ARG1} CC $ $ took and its took its Figure 1. A sentence parse tree with two predicative subtree structures (PATs) parse tree; 2. Let P and A be respectively set predicates and set parse-tree nodes (i.e. potential arguments); 3. For each pair <p, a> P A: - extract feature representation set, F p,a; - if subtree rooted in a covers exactly words one argument p, put F p,a in T + set (positive examples), orwise put it in T set (negative examples). The outputs above algorithm are T + and T sets. For subtask Boundary Detection se can be directly used to train a boundary classifier (e.g. an VM). Concerning subtask Role Classification, generic binary role labeler for role r (e.g. an VM) can be trained on T +, i.e. its positive examples and T r r, i.e. its negative examples, where T + = T r + Tr, according to ONEvs-ALL scheme. The binary classifiers are n used to build a general role multiclassifier by simply selecting argument associated with maximum among classification scores resulting from individual binary VM classifiers. Regarding design features for predicate-argument pairs, we can use attribute-values defined in [5] or tree structures [10]. Although we focus on latter approach, a short description former is still relevant as y are used by TBC. They include Phrase Type, Predicate Word, Head Word, Governing Category, Position and Voice features. For example, Phrase Type indicates syntactic type phrase labeled as a predicate argument and ParseTreePathcontains path in parse tree between predicate and argument phrase, expressed as a sequence nonterminal labels linked by direction (up or down) symbols, e.g. V. A viable alternative to manual design syntactic features is use tree-kernel functions. These implicitly define a feature space based on all possible tree substructures. Given two trees T 1 and T 2, instead representing m with whole fragment space, we can apply kernel function to evaluate number common fragments. Formally, given a tree fragment space F = {f 1,f 2,...,f F }, indicator function I i(n) is defined, which is equal to 1 if target f i is rooted at node n and equal to 0 orwise. A tree-kernel function over T 1 and T 2 is K(T 1,T 2)= n 1 N T1 n 2 (n N T2 1,n 2), where N T1 and N T2 are sets T 1 s and T 2 s nodes, respectively. In turn (n 1,n 2) = F i=1 λl(f i) I i(n 1)I i(n 2), where 0 λ 1 and l(f i) is number levels subtree f i. Thus λ l(fi) assigns a lower weight to larger fragments. When λ =1, is equal to number common fragments rooted at nodes n 1 and n 2. As described in [3], can be computed in O( N T1 N T2 ). 3 AUTOMATIC CLAIFICATION OF PREDICATE ARGUMENT TRUCTURE Most semantic role labeling models rely only on features extracted from current candidate argument node. To consider a complete predicate argument structure, classifier should formulate a hyposis on potential parse-tree node subsets which include argument nodes target predicate. Without boundary information, we should consider all possible tree node subsets, i.e. an exponential number. To solve such problems we apply a traditional boundary classifier (TBC) to select set potential arguments PA. uch a subset can be associated with a subtree which in turn can be classified by means a tree kernel function. Intuitively, such a function measures to what extent a given candidate subtree is compatible with subtree a correct predicate argument structure. 3.1 The Predicate Argument panning Trees (PATs) We consider predicate argument structures annotated in Prop- Bank along with corresponding TreeBank data as our object space. Given target predicate p in a sentence parse tree T and a subset s = {n 1,.., n k } its nodes, N T,wedefine as spanning tree root r lowest common ancestor n 1,.., n k. The node spanning tree p s is subtree rooted in r from which nodes that are neir ancestors nor descendants any n i are removed. ince predicate arguments are associated with tree nodes (i.e. y exactly fit into syntactic constituents), we can define Predicate Argument panning Tree (PAT) a predicate argument set, {a 1,.., a n}, as node spanning tree (NT) over such nodes, i.e. p {a1,..,a n}. APAT corresponds to minimal sub-parse tree whose leaves are all and only words compounding arguments. For example, Figure 1 shows parse tree sentence " took and its ". took {ARG0,ARG 1 } and {ARG0,ARG 1 } are two PAT structures associated with two predicates took and, respectively. All or possible NTs are not valid PATs for se predicates. Note that labeling p s, s N T with a PAT Classifier is equivalent to solve boundary detection problem. The critical points for application PATs are: (1) how to design suitable features for characterization PATs. This new structure requires a careful linguistic investigation about its significant properties. (2) How to deal with exponential number NTs. For first problem, use tree kernels over PATs can be an alternative to manual feature design as learning machine, (e.g. VMs) can select most relevant features from a high dimensional feature space. In or words, we can use a tree kernel function to estimate similarity between two PATs (see ection 2), hence avoiding to define explicit features. For second problem re are two main approaches: (1) We can consider classification confidence provided by TBC[11] and evaluate m most probable argument node sequences {n 1,.., n k }. On mnts derived from such sequences, we can apply a reranking approach based on VMs with tree kernel. (2) We can use only set nodes PA decided by TBC (i.e. those classified as

3 (a) (b) (c) Incorrect PAT Incorrect PAT -0-0 Arg Correct PAT Correct PAT Arg. 1 Figure 2. Two-step boundary classification. a) entence tree; b) Two candidate PATs; c) Extended PAT-Ord labeling arguments). Thus we need to classify only set P NTs associated with any subset PA, i.e. P = {p s : s PA}. As a re-ranking task would not give an explicit and clear indication classifier ability to distinguish between correct and incorrect predicate argument structures, we preferred to apply second approach. However, also classification P may be computationally problematic, since oretically re are P =2 PA members. In order to develop a very efficient procedure, we applied PAT Classifier only to structures that we know that are incorrect. A simple way to detect m is to look for node pairs <n 1,n 2> PA PA involving overlapping nodes, i.e. eir n 1 is ancestor n 2 or vice versa. Note that structures that contain overlapping nodes ten also contain correct substructures, i.e. subset PA are associated to correct PAT. Assuming above hyposis, we create two node sets PA 1 = PA {n 1} and PA 2 = PA {n 2} and classify m with PAT Classifier to select correct set argument boundaries. This procedure can be generalized to a set overlapping nodes greater than 2, selecting a maximal set nonoverlapping nodes. Additionally, as Precision TBC is generally high, number overlapping nodes is very small. Thus we can explore whole space. Figure 2 shows a working example multi-stage classifier. In Frame (a), TBC labels as potential arguments (gray color) three overlapping nodes related to ARG1. This leads to two possible solutions (Frame (b)) which only first is correct. In fact, according to second one, propositional phrase would be incorrectly attached to verbal predicate, i.e. in contrast with parse tree. The PAT Classifier, applied to two NTs, is expected to detect this inconsistency and provide correct output. 3.2 Designing Features with Tree Fragments The Frame (b) Figure 2 shows two perfectly identical NTs. Therefore, it is not possible to discern between m using only ir fragments. To solve problem we can enrich NTs by marking ir argument nodes with a progressive number, starting from leftmost argument. For example, in first NT Frame (c), we mark as -0 and -1 first and second argument nodes whereas in second NT we trasform three argument node labels in -0, -1 and -2. We will refer to resulting structure as a PAT-Ord (ordinal number). This simple modification allows tree kernel to generate different argument structures for above NTs. For example, from first NT in Figure 2c, fragments [-1 [][]], [ [][]] and [ [][]] are generated. They do not match anymore with [-0 [][]], [-1 [][]] and [-2 [][]] fragments generated from second NT in Figure 2c. We also explored anor relevant direction in enriching feature space. It should be noted that semantic information provided by role type can remarkably help detection correct or incorrect predicate argument structures. Thus, we enrich argument node label with role type, e.g. -0 and -1 correct PAT Figure 2.(c) becomes -Arg0 and -Arg1 (not shown in figure). We refer to this structure as PAT-Arg. Of course, to apply PAT-Arg Classifier, we need a traditional role multiclassifier (TRM) which labels arguments detected by TBC. 4 THE EXPERIMENT The experiments were carried out within setting defined in CoNLL-2005 hared Task [1]. We used PropBank corpus available at ace, along with Penn Tree- Bank 2 for gold trees ( treebank) [9], which includes about 53,700 sentences. ince experiments over gold parse trees inherently overestimate accuracy in semantic role labeling task, e.g. 93% vs. 79% [11], we also adopted Charniak parse trees from CoNLL 2005 hared Task data (available at srlconll/) along with ficial performance evaluator. All experiments were performed with VM-light stware [6] available at svmlight.joachims.org. For TBC and TRM, we used linear kernel with a regularization parameter (option -c) equal to 1. A cost factor (option -j) 10 was adopted for TBC to have a higher Recall, whereas for TRM, cost factor was parameterized according to maximal accuracy each argument class on validation set. For PAT Classifier, we implemented tree kernel defined [3] inside VM-light with a λ equal to 0.4 (see [10]). 4.1 Gold tandard Tree Evaluations In se experiments, we used sections from 02 to 08 TreeBank/PropBank (54,443 argument nodes and 1,343,046 nonargument nodes) to train traditional boundary classifier (TBC). Then, we applied it to classify sections from 09 to 21 (125,443 argument nodes vs. 3,010,673 non-argument nodes). As results we obtained 2,988 NTs containing at least one overlapping node pair out total 65,212 predicate structures (according to TBC decisions). From 2,988 overlapping structures we furr generated 3,624 positive and 4,461 negative NTs, that we used to train PAT-Ord Classifier.

4 TBC TBC+ TBC+HEU TBC+PAT-Ord P. R. F 1 P. R. F 1 P. R. F 1 P. R. F 1 All truct Overl. truct Table 1. Two-steps boundary classification performance using TBC, and HEU baselines, and PAT-Ord classifier. ection 21 ection 23 bnd bnd+class bnd bnd+class PAT Classifier PAT Classifier PAT Classifier PAT Classifier - Ord Arg - Ord Arg - Ord Arg - Ord Arg P R F Table 2. emantic Role Labeling performance on automatic trees using PAT classifiers. The performance was evaluated through F 1 measure over ection 23, which includes 10,406 argument nodes out 249,879 parse tree nodes. After applying TBC classifier, we detected 235 overlapping NTs, from which we extracted 204 correct PATs and 385 incorrect ones. On such gold standard trees, we measured only performance PAT-Arg Classifier which was very high, i.e. 87.1% in Precision and 89.2% in Recall (88.1% F 1). Using PAT-Ord Classifier we removed from TBC outcome PA that caused overlaps. To measure impact on boundary detection task, we compared it with three different boundary classification baselines: 1. TBC: overlaps are ignored and no decision is taken. This provides an upper bound for recall as no potential argument is rejected for later labeling. Notice that, in presence overlapping nodes, sentence cannot be annotated correctly. 2. : one among non-overlapping structures with maximal number arguments is randomly selected. 3. HEU (heuristic): one NTs which contains nodes with lowest overlapping score is chosen. This score counts number overlapping node pairs in NT. For example, in Figure 2.(a) we have an that overlaps with two nodes and, thus it is assigned a score 2. The third row Table 1 shows results TBC, TBC +, TBC+ HEU and TBC+PAT-Ord in columns 2,3,4 and 5, respectively. We note that: First, TBCF 1 is slightly higher than result obtained in [11], i.e. 95.4% vs. 93.8% under same training/testing conditions (i.e. same PropBank version, same training and testing split and same machine learning algorithm). This is explained by fact that we did not include continuations and co-referring arguments that are more difficult to detect. econd, both and HEU do not improve TBC result. This can be explained by observing that in 50% cases a correct node is removed. Third, when PAT-Ord Classifier is used to select correct node, F 1 increases 1.49%, i.e. (96.86 vs ). This is a relevant result as it is difficult to increase very high baseline given by TBC. Finally, we tested above classifiers on overlapping structures only, i.e. we measured PAT-Ord Classifier improvement on all and only structures that required its application. uch reduced test set contains 642 argument nodes and 15,408 non-argument nodes. The fourth row Table 1 reports classifier performance on such task. We note that PAT-Ord Classifier improves or heuristics about 20%. 4.2 Automatic Tree Evaluations In se experiments we used automatic trees generated by Charniak s parser and predicate argument annotations defined in CoNLL 2005 shared task. Again, we trained TBC on sections whereas, to achieve a very accurate role classifier, we trained TRM on all sections Then, we trained PAT, PAT- Ord, and PAT-Arg Classifiers on output TBC and TRM over sections for a total 183,642 arguments, 30,220 PATs and 28,143 incorrect PATs. ection 21 ection 23 PAT Class. P. R. F 1 P. R. F Ord Arg Table 3. PAT, PAT-Ord, and PAT-Arg performances on sections 21 and 23. First, to test TBC, TRM and PAT classifiers, we used ection 23 (17,429 arguments, 2,159 PATs and 3,461 incorrect PATs) and ection 21 (12,495 arguments, 1,975 PATs and 2,220 incorrect PATs). The performance derived on ection 21 corresponds to an upper bound our classifiers, i.e. results using an ideal syntactic parser ( Charniak s parser was trained also on ection 21) and an ideal role classifier. They provide PAT family classifiers with accurate syntactic and semantic information. Table 3 shows Precision, Recall and F 1 measures PAT classifiers over NTs sections 21 and 23. Rows 2, 3 and 4 report performance PAT, PAT-Ord, and PAT-Arg Classifiers, respectively. everal points should be remarked: (a) general performance is lower than one achieved on gold trees with PAT- Ord, i.e. 88.1% (see ection 4.1). The impact parsing accuracy is also confirmed by gap about 6% points between sections 21 and 23. (b) The ordinal numbering arguments (Ord) and role type information (Arg) provide tree kernel with more meaningful fragments since y improve basic model about 4%. (c) The deeper semantic information generated by Arg labels provides useful clues to select correct predicate argument structures, since it improves Ord model on both sections. econd, we measured impact PAT classifiers on both phases semantic role labeling. Table 2 reports results on two sections 21 and 23. For each m, Precision, Recall and

5 F 1 different approaches to boundary identification (bnd) and to complete task, i.e. boundary and role classification (bnd+class) is shown. uch approaches are based on different strategies to remove overlaps, i.e. PAT, PAT-Ord, PAT-Arg and baseline () which uses a random selection non-overlapping structures. We needed to remove overlaps from baseline in order to apply CoNLL evaluator. We note that: (a) for any model, boundary detection F 1 on ection 21 is about 10 points higher than F 1 on ection 23 (e.g. 87.0% vs. 77.9% for ). As expected parse tree quality is very important to detect argument boundaries. (b) On real test (ection 23) classification introduces labeling errors which decrease accuracy about 5% (77.9 vs 72.9 for ). (c) The Ord and Arg approaches constantly improve baseline F 1 about 1%. uch a result does not surprise as it is similar to one obtained on Gold Trees: overlapping structures are a small percentage test set thus overall impact cannot be very high. Third, comparison with CoNLL 2005 results [1] can only be carried out with respect to whole RL task (bnd+class in table 2) since boundary detection versus role classification is generally not provided in CoNLL Moreover, our best global result, i.e. 73.9%, was obtained under two severe experimental factors: a) use just 1/3 available training set, and b) usage linear VM model for TBC classifier, which is much faster than polynomial VMs but also less accurate. However, we note promising results PAT meta-classifier, which can be used with any best figure CoNLL systems. Finally, kernel outcome suggests that: (a) it is robust to parse tree errors since it preserves same improvement across trees derived with different accuracy, i.e. gold trees Penn TreeBank and automatic trees ection 21 and ection 23. (b) It shows a high accuracy for classification correct and incorrect predicate argument structures. This last property is quite interesting considering important findings a recent paper [13]. The winning strategy to improve semantic role labeling relates to exploiting different labeling hyposes, i.e. several PA i sets derived from different parsing alternatives. A joint inference procedure was used to select most likely set s ipa i. In our opinion, PAT Classifiers seem very well suited to selecting such a set. 5 RELATED WORK Recently, many kernels for natural language applications have been designed. In what follows, we highlight ir difference and properties. The tree kernel used in this article was proposed in [3] for syntactic parsing reranking. It was experimented with Voted Perceptron and was shown to improve syntactic parsing. In [4], a feature description language was used to extract structural features from syntactic shallow parse trees associated with named entities. The experiments on named entity categorization showed that when description language selects an adequate set tree fragments Voted Perceptron algorithm increases its classification accuracy. The explanation was that complete tree fragment set contains many irrelevant features and may cause overfitting. In [13], a set different syntactic parse trees, e.g. Charniak n best parse trees, were used to improve RL accuracy. These different sources syntactic information were used to generate a set different RL outputs. A joint inference stage was applied to resolve inconsistency different outputs. This approach may be applied to our tree kernel strategies to design a joint tree kernel model. In [14], it was observed that re are strong dependencies among labels semantic argument nodes a verb. Thus, to approach problem as classification a overall role sequences, a re-ranking method is applied to assignments generated by a TRM. This approach is in line with our PAT Classifier that can be used to refine such re-ranking strategy. In [12], some experiments were conducted on RL systems trained using different syntactic views. Again, our approach may be used in conjunction with this model to provide a furr syntactic view related to whole predicate argument structure. 6 CONCLUION The feature design for new natural language learning tasks is difficult. We can take advantage from kernel methods to model our intuitive knowledge about target linguistic phenomenon. In this paper we have shown that we can exploit properties tree kernels to engineer syntactic features for semantic role labeling task. The experiments on gold standard trees as well as on automatic trees suggest that (1) information related to whole predicate argument structure is important and (2) tree kernels can be used to generate syntactic/semantic features. The remarkable result is that such kind structures are robust with respect to parse tree errors. In future, we would like to use an approach similar to PAT classifier to select best predicate argument annotation from those carried out on several parse trees provided by one or more parsing models. REFERENCE [1] Xavier Carreras and Lluís Màrquez, Introduction to CoNLL-2005 shared task: emantic role labeling, in Proceedings CoNLL-2005, (2005). [2] Michael Collins, Discriminative reranking for natural language parsing, in In Proceedings ICML 2000, (2000). [3] Michael Collins and Nigel Duffy, New ranking algorithms for parsing and tagging: Kernels over discrete structures, and voted perceptron, in ACL02, (2002). [4] Chad Cumby and Dan Roth, Kernel methods for relational learning, in Proceedings ICML 2003, Washington, DC, UA, (2003). [5] Daniel Gildea and Daniel Jurasfky, Automatic labeling semantic roles, Computational Linguistic, 28(3), , (2002). [6] T. Joachims, Making large-scale VM learning practical., in Advances in Kernel Methods - upport Vector Learning, eds., B. chölkopf, C. Burges, and A. mola, (1999). [7] Paul Kingsbury and Martha Palmer, From Treebank to PropBank, in Proceedings LREC 02), Las Palmas, pain, (2002). [8] Ron Kohavi and Dan ommerfield, Feature subset selection using wrapper model: Overfitting and dynamic search space topology, in 1st KDD Conference, (1995). [9] M. P. Marcus, B. antorini, and M. A. Marcinkiewicz, Building a large annotated corpus english: The Penn Treebank, Computational Linguistics, 19, , (1993). [10] Alessandro Moschitti, A study on convolution kernel for shallow semantic parsing, in Proceedings ACL 04, Barcelona, pain, (2004). [11] ameer Pradhan, Kadri Hacioglu, Valeri Krugler, Wayne Ward, James H. Martin, and Daniel Jurafsky, upport vector learning for semantic argument classification, Machine Learning Journal, (2005). [12] ameer Pradhan, Wayne Ward, Kadri Hacioglu, James Martin, and Daniel Jurafsky, emantic role labeling using different syntactic views, in Proceedings ACL 05, (2005). [13] V. Punyakanok, D. Roth, and W. Yih, The necessity syntactic parsing for semantic role labeling, in Proceedings IJCAI 2005, (2005). [14] Kristina Toutanova, Aria Haghighi, and Christopher Manning, Joint learning improves semantic role labeling, in Proceedings ACL 05, (2005). [15] Kristina Toutanova, Penka Markova, and Christopher D. Manning, The leaf projection path view parse trees: Exploring string kernels for hpsg parse selection, in In Proceedings EMNLP 2004, (2004).