A Machine Learning Approach to Convert CCGbank to Penn Treebank

Similar documents
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Shallow Parsing with Apache UIMA

Self-Training for Parsing Learner Text

Effective Self-Training for Parsing

Automatic Detection and Correction of Errors in Dependency Treebanks

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Transition-Based Dependency Parsing with Long Distance Collocations

EVALITA 07 parsing task

Tekniker för storskalig parsning

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Chunk Parsing. Steven Bird Ewan Klein Edward Loper. University of Melbourne, AUSTRALIA. University of Edinburgh, UK. University of Pennsylvania, USA

Learning Translation Rules from Bilingual English Filipino Corpus

Basic Parsing Algorithms Chart Parsing

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Search and Information Retrieval

Language Model of Parsing and Decoding

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

31 Case Studies: Java Natural Language Tools Available on the Web

The Evalita 2011 Parsing Task: the Dependency Track

DEPENDENCY PARSING JOAKIM NIVRE

Peking: Profiling Syntactic Tree Parsing Techniques for Semantic Graph Parsing

Easy-First, Chinese, POS Tagging and Dependency Parsing 1

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

English Grammar Checker

Natural Language Processing

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Text Chunking based on a Generalization of Winnow

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Identifying Focus, Techniques and Domain of Scientific Papers

Technical Report. The KNIME Text Processing Feature:

Chapter 13: Query Processing. Basic Steps in Query Processing

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Syntax: Phrases. 1. The phrase

Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy

An Efficient Algorithm for Web Page Change Detection

Brill s rule-based PoS tagger

Semantic Tree Kernels to classify Predicate Argument Structures

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Data quality in Accounting Information Systems

Text Analytics Illustrated with a Simple Data Set

Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation

ER E P M A S S I CONSTRUCTING A BINARY TREE EFFICIENTLYFROM ITS TRAVERSALS DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A

SZTE-NLP: Aspect Level Opinion Mining Exploiting Syntactic Cues

Active Learning SVM for Blogs recommendation

Online Large-Margin Training of Dependency Parsers

Object-Extraction and Question-Parsing using CCG

Tagging with Hidden Markov Models

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Efficient Integration of Data Mining Techniques in Database Management Systems

Treebank Search with Tree Automata MonaSearch Querying Linguistic Treebanks with Monadic Second Order Logic

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Natural Language Processing in the EHR Lifecycle

Concept Term Expansion Approach for Monitoring Reputation of Companies on Twitter

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Task-oriented Evaluation of Syntactic Parsers and Their Representations

Converting a Number from Decimal to Binary

End-to-End Sentiment Analysis of Twitter Data

Social Media Mining. Data Mining Essentials

Building a Question Classifier for a TREC-Style Question Answering System

Web Data Extraction: 1 o Semestre 2007/2008

Micro blogs Oriented Word Segmentation System

How To Use Neural Networks In Data Mining

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features

Concepts of digital forensics

Spam Detection A Machine Learning Approach

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

A* CCG Parsing with a Supertag-factored Model

Automatic Text Analysis Using Drupal

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Web Database Integration

Protein Protein Interaction Networks

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Web Document Clustering

Natural Language to Relational Query by Using Parsing Compiler

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

Term extraction for user profiling: evaluation by the user

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Binary Coded Web Access Pattern Tree in Education Domain

Evaluation of Sub Query Performance in SQL Server

A Property and Casualty Insurance Predictive Modeling Process in SAS

PP-Attachment. Chunk/Shallow Parsing. Chunk Parsing. PP-Attachment. Recall the PP-Attachment Problem (demonstrated with XLE):

Mobile Storage and Search Engine of Information Oriented to Food Cloud

A CCG Parsing with a Supertag-factored Model

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

03 - Lexical Analysis

2 Related Work and Overview of Classification Techniques

Classification of Natural Language Interfaces to Databases based on the Architectures

Blog Post Extraction Using Title Finding

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Mining wisdom. Anders Søgaard

Outline of today s lecture

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

ž Lexicalized Tree Adjoining Grammar (LTAG) associates at least one elementary tree with every lexical item.

Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian

Chapter 8. Final Results on Dutch Senseval-2 Test Data

A Workbench for Prototyping XML Data Exchange (extended abstract)

Transcription:

A Machine Learning Approach to Convert CCGbank to Penn Treebank Xiaotian Zhang 1,2 Hai Zhao 1,2 Cong Hui 1,2 (1) MOE-Microsoft Key Laboratory of Intelligent Computing and Intelligent ystem; (2) Department of Computer cience and Engineering, #800 Dongchuan Road, hanghai, China, 200240 xtianzh@gmailcom zhaohai@cssjtueducn huicong88126@gmailcom Abstract Conversion between different grammar frameworks is of great importance to comparative performance analysis of the parsers developed based on them and to discover the essential nature of languages This paper presents an approach that converts Combinatory Categorial Grammar (CCG) derivations to Penn Treebank (PTB) trees using a maximum entropy model Compared with previous work, the presented technique makes the conversion practical by eliminating the need to develop mapping rules manually and achieves state-of-the-art results Keywords: CCG, Grammar conversion This work was partially supported by the National Natural cience Foundation of China (Grant No 60903119 and Grant No 61170114), the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No 20110073120022, the National Basic Research Program of China (Grant No 2009CB320901), and the European Union eventh Framework Program (Grant No 247619) Corresponding author Proceedings of COLING 2012: Demonstration Papers, pages 535 542, COLING 2012, Mumbai, December 2012 535

1 Introduction Much has been done in cross-framework performance analysis of parsers, and nearly all such analysis applies a converter across different grammar frameworks Matsuzaki and Tsujii (2008) compared a Head-driven Phrase tructure Grammar (HPG) parser with several Context Free Grammar (CFG) parsers by converting the parsing result to a shallow CFG analysis using an automatic tree converter based on tochastic ynchronous Tree-ubstitution Grammar Clark and Curran (2007) converted the CCG dependencies of the CCG parser into those in Depbank by developing mapping rules via inspection so as to compare the performance of their CCG parser with the RAP parser Performing such a conversion has been proven to be a time-consuming and non-trivial task (Clark and Curran, 2007) Although CCGBank (Hockenmaier and teedman, 2007) is a translation of the Penn Treebank (Marcus et al, 1993) into a corpus of Combinatory Categorial Grammar derivations, little work has been done in conversion from CCG derivations back to PTB trees besides (Clark and Curran, 2009) and (Kummerfeld et al, 2012) In the work of Clark and Curran (2009), they associated conversion rules with each local tree and developed 32 unary and 776 binary rule instances by manual inspection Considerable time and effort were spent on the creation of these schemas They show that although CCGBank is derived from PTB, the conversion from CCG back to PTB trees is far from trivial due to the non-isomorphic property of tree structures and the non-correspondence of tree labels In the work of Kummerfeld et al (2012), although no ad-hoc rules over non-local features were required, a set of instructions, operations, and also some special cases were defined Nevertheless, is it possible to convert CCG derivations to PTB trees using a statistical machine learning method instead of creating conversion grammars or defining instructions, and thus avoid the difficulties in developing these rules? Is it possible to restore the tree structure by only considering the change applied to the structure in the original generating process? Our proposed approach answers yes 2 PTB2CCG and the Inverse Let us first look at how CCG is derived from PTB The basic procedure for converting a PTB tree to CCG described in (Hockenmaier and teedman, 2007) consists of four steps: 1 Determine constituent types; 2 Make the tree binary; 3 Assign categories; 4 Assign dependencies; Clearly, only step 2 changes the tree structure by adding dummy nodes to make it binary, precisely, to make every node have only one or two child nodes Take the short sentence Bond prices were up in Figure 1(a) and 1(b) for example As a preprocessing step, combine every internal node which has only one child with its single child and label the new node with the catenation of the original labels and # That means to combine the shaded nodes in Figure 1 and obtain the corresponding new node After the preprocessing, the node with a bold border in Figure 1(a) can be identified as dummy because the structure of CCG turns out to be the same as that of PTB if that node is deleted and its children are attached directly to its parent 536

[dcl] #N [dcl] VP [dcl]\ ADVP#RB N ([dcl]\)/([adj]\) were [adj]\ up NN Bond NN prices VBD were ADVP N/N Bond N prices RB up (a) Category A: CCG Example (b) Category A: PTB Example [b]\ [b]\ VP ([b]\)/ VB CALL (([b]\)/)/ CALL IT #N #PRP #NN N un-advertising PRP IT NN un-advertising (c) Category B: CCG Example (d) Category B: PTB Example Figure 1: CCG and corresponding PTB Examples of Category A and B The grey shadow indicates the preprocessing that combines every internal node which has only one child with its single child And the node with a bold border in Figure (a) is dummy VP VP NN Bond NN prices VBD were ADVP NN Bond NN prices VBD were RB up RB up (a) Output PTB by Classi f ier 1 (b) Output PTB by Classi f ier 2 Figure 2: Output PTB by Classi f ier 1 and Classi f ier 2 when testing on the CCG sample in Figure 1(a) Classi f ier 2 corrects the output PTB of Classi f ier 1 by adding ADVP, the node with a bold border, to dominate the leaf RB 537

However, it is more complicated in reality There indeed exist CCG tree structures that can not be derived by just adding dummy nodes to the corresponding PTB trees, as shown in Figure 1(c) and 1(d), because actually a more sophisticated algorithm is used in the conversion from PTB to CCG After investigation, all the cases can be grouped into two categories: Category A: The CCG tree structure can be derived from the PTB tree by adding dummy nodes, after the preprocessing step, as shown in Figure 1(a) and 1(b) Category B 1 : The CCG tree structure cannot be derived from the PTB tree by adding dummy nodes even if we perform the preprocessing, as shown in Figure 1(c) and 1(d) The cases of Category B are less than 10% of the whole bank, so we simplify the problem by only focusing on solving the cases of Category A From the above discussion, after the preprocessing, one classifier is required to classify the CCG nodes into the target classes which are PTB labels and dummy However, preliminary experiments show that such a model still cannot handle the occasions in which a parent node occurs with a single child leaf in the PTB tree very well For example, the [adj]\ node in the gold CCG tree (Figure 1(a)) tends to be classified as an RB node in the predicted PTB tree (Figure 2(a)) instead of the gold PTB label ADVP#RB (Figure 1(b)) o we use another classifier to predict whether each leaf node in the PTB tree produced by the first classifier has a parent dominating only itself and what the label of the parent is This added classifier helps to identify the ADVP node (Figure 2(b)) in the above example and experiments show it increases the overall F-measure by up to 2% 3 The Approach Firstly, the training and test data need to be preprocessed as mentioned in ection 2 Then, classifiers are constructed based on the maximum entropy model 2 As for the training process, the dummy nodes and the corresponding PTB labels could be identified by comparing post-order traversal sequences of each pair of CCG and PTB trees And thus Classi f ier 1 can be trained And based on the output of Classi f ier 1 tested on the training set and corresponding gold PTB trees, Classi f ier 2 can be trained When testing, there are two steps Firstly, Classi f ier 1 is to classify the CCG labels into PTB labels and dummy The CCG tree is traversed in a bottom-up order and if the node is classified as dummy, delete it and attach its children to its parent, otherwise replace the CCG label with the predicted PTB label Then, an intermediate PTB tree is built econdly, Classi f ier 2 examines each leaf in the intermediate PTB tree and predicts whether a parent node should be added to dominate the leaf The feature sets for the two classifiers are listed in Table 1 4 Experiments We evaluate the proposed approach on CCGBank and PTB As in (Clark and Curran, 2009), we use ection 01-22 as training set 3, ection 00 as development set to tune the parameters and ection 23 as test set Experiments are done separately with gold PO tags and auto PO tags predicted by Lapos tagger (Tsuruoka et al, 2011) 1 According to our inspection, these cases always occur in certain language structures, such as complex objects, of phrases and so on 2 We use the implementation by Apache OpenNLP http://incubatorapacheorg/opennlp/indexhtml 3 ince the converter is trained on the gold banks, it doesn t have access to parsing output And thus, the converting process is general and not specific to parser errors 538

Feature et 1 Feature et 2 Features CCGN ode WordForm Label PO ParentLabel ParentLabel LiblingLabel IndexAmongChildren LiblingWordForm ParentChildrenCnt RiblingLabel LiblingLabel RiblingWordForm LiblingWordForm Children RiblingLabel ChildCnt RiblingWordForm Features P T BN ode LiblingRightMostLabel Children RiblingLeftMostLabel LiblingLabel econdriblinglabel PO econdliblinglabel PO 1 Table 1: Features Used by Two Classifiers ( Libling, and Ribling represent Left sibling and Right sibling PO 1 represents the PO tag of the previous word) PO Tag P R F Gold 9699 9529 9614 Lapos 9682 9512 9596 Table 2: Evaluation on all the sentences of Category A in ection 23 Firstly we leave out Category B and test on all the sentences belonging to Category A from ection 23 Table 2 shows our conversion accuracy The numbers are bracketing precision, recall, F-score using the EVALB 4 evaluation script In order to compare with (Clark and Curran, 2009) 5, the approach is also tested on all of ection 00 and 23 The oracle conversion results are presented in Table 3 It shows the F-measure of our method is about one point higher than that of Clark and Curran (2009), no matter what kind of PO tags 6 is applied 1 08 Percentage F measure 06 04 02 0 PP VP BARADVPADJP QP WH PRT Figure 3: The conversion F-measure of top ten major kinds of phrases 4 http://nlpcsnyuedu/evalb/ 5 We don t know which kind of PO tags, gold or predicted, was used in (Clark and Curran, 2009) as they did not report it in their paper 6 We have tried with different PO taggers and the results stay more or less stable 539

ection P R F Our gold 00 all 9692 9482 9586 00 len 40 9709 9540 9624 23 all 9667 9477 9571 23 len 40 9669 9479 9573 Our lapos 00 all 9685 9474 9579 00 len 40 9703 9534 9618 23 all 9649 9464 9556 23 len 40 9651 9467 9558 Clark& 00 all 9337 9515 9425 Curran, 00 len 40 9411 9565 9488 2009 23 all 9368 9513 9440 23 len 40 9375 9523 9448 Table 3: Evaluation on all of ection 00 and 23 -imple -Children -Parent -ibling F 938 9128 9518 8902 Table 4: Results on ection 23 using gold PO tags and features excluding one category each time To investigate the effectiveness of the features, we divide them into four categories 7, children, parent, sibling features and simple features (label, PO and wordform), and test using feature sets from which one category of features is removed each time Table 4 shows that all the kinds of features have an effect and the sibling-related features are the most effective In error analysis, the conversion F-measure of each kind of phrases is examined (see Figure 3) As for the F-measure of the top ten major kinds, seven of them are over 90%, two around 80% while the worst is 448% The high accuracy in predicting most major kinds of phrases leads to the good overall performance of our approach and there is still some room to improve by further finding effective features to classify QP phrases better 5 Conclusions We have proposed a practical machine learning approach 8 to convert the CCG derivations to PTB trees efficiently and achieved an F-measure over 95% The core of the approach is based on the maximum entropy model Compared with conversion methods that use mapping rules, applying such a statistical machine learning method helps saving considerable time and effort References Clark, and Curran, J (2007) Formalism-independent parser evaluation with ccg and depbank In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 248 255, Prague, Czech Republic Association for Computational Linguistics Clark, and Curran, J R (2009) Comparing the accuracy of ccg and penn treebank parsers In Proceedings of the ACL-IJCNLP 2009 Conference hort Papers, pages 53 56, untec, ingapore Association for Computational Linguistics 7 Due to space limitations, we show the importance of every category of features instead of each individual feature 8 This software has been released, https://sourceforgenet/p/ccg2ptb/home/home/ 540

Hockenmaier, J and teedman, M (2007) Ccgbank: A corpus of ccg derivations and dependency structures extracted from the penn treebank Computational Linguistics, 33:355 396 Kummerfeld, J K, Klein, D, and Curran, J R (2012) Robust conversion of ccg derivations to phrase structure trees In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: hort Papers), pages 105 109, Jeju Island, Korea Association for Computational Linguistics Marcus, M P, Marcinkiewicz, M A, and antorini, B (1993) Building a large annotated corpus of english: the penn treebank Computational Linguistics, 19:313 330 Matsuzaki, T and Tsujii, J (2008) Comparative parser performance analysis across grammar frameworks through automatic tree conversion using synchronous grammars In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING 08, pages 545 552, troudsburg, PA, UA Association for Computational Linguistics Tsuruoka, Y, Miyao, Y, and Kazama, J (2011) Learning with lookahead: can history-based models rival globally optimized models? In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 11, pages 238 246, troudsburg, PA, UA Association for Computational Linguistics 541