Sentence Simplification and Automatic Syntactic Analysis

Transcription

1 Evaluation of Sentence Simplification by Machine Learning and Handcrafted Rules ATraNoS WP4-15 Erik F. Tjong Kim Sang CNTS - Language Technology Group University of Antwerp erikt@uia.ua.ac.be Abstract This progress report describes the Antwerp approach to sentence simplification in the framework of the ATraNoS project with a focus on the evaluation of the sentence simplification systems we have built. We describe earlier work in sentence simplification and explain our two approaches to sentence reduction: one based on machine learning while the other one relies on handcrafted deletion rules. We compare the performances of the systems and then discuss the applied evaluation method 1. 1 Introduction There exists a large body of work on text summarization but the largest part of this concerns extraction approaches in which a subset of the sentences in a text is proposed as a summary of the text. Most of the work on sentence simplification (also called sentence reduction, sentence condensation and sentence compression) relies on an automatic syntactic analysis after which handcrafted rules, sometimes with learned probabilities, remove optional branches in the syntactic trees. For evaluation, most systems rely on human judges which examine the grammaticality and the information content of the reduced sentence although some authors also propose automatic evaluation methods which are shown to correlate with the decisions made by the judges. Knight and Marcu (2000) compare two machine learners, a noisy channel model and decision trees, on compressing sentences by learning to map syntactic trees to other trees. After training on a corpus of about 1000 sentences, both systems performed significantly better than a baseline based on word bigram occurrences, according to a blind evaluation by human judges. Hori et al. (2002) compute various word and word relation scores, and build a compressed spoken sentence by finding an optimal set of words with dynamic programming. 2004). 1 This report is part of a paper submitted for the proceedings of CLIN-2003 (Tjong Kim Sang et al., 1

2 for each pair of transcriptions and subtitles remove punctuation and capitalization link every word to copies and synonyms remove duplicate links based on longest chains remove remaining crossing links link multi-word synonyms link neighboring unlinked words Table 1: High-level pseudo-code of the word alignment algorithm They show that this approach performs better than randomly choosing words from the original sentence, according to an automatic evaluation. Most other work on sentence condensation involves (shallow) parsing, sometimes with additional lexical and morphological tools, together with rule-based reduction, occasionally with statistical scores like tf idf (Carroll et al., 1998; Chandrasekar and Srinivas, 1997; Grefenstette, 1998; Jing, 2000; Kiyota and Kurohashi, 2001; Riezler et al., 2003). An interesting approach to sentence compression is by paraphrasing, that is replacing phrases by shorter phrases with the same meaning. Shinyama et al. (2002) outline how paraphrases can be derived automatically from related documents. The application of sentence simplification for automatically generating subtitles for TV programs is mentioned in Robert-Ribes et al. (1999) but to our knowledge there has been no practical study linking the two. 2 Data We have defined sentence simplification as a word classification task. In order to convert a transcribed sentence to a subtitle, four basic word actions can be taken: word copy, word deletion, word replacement and word insertion. In comparison with a subtitle, the words from the transcribed sentence can be copied, deleted or replaced by another word. Furthermore, words that do not appear in the transcription can be inserted in the subtitle. Since the subtitles are often quite similar to the transcription, the word copy action is the most frequent action. Here is an example from the corpus: de politici vinden de euro natuurlijk/delete een goeie/goede zaak In this sentence, two words need to be changed. The adverb natuurlijk needs to be deleted and the adjective goeie needs to be replaced by goede. In order to find out what word actions are required to transform the transcription to the subtitles, the sentence-level alignment of the parallel ATraNoS corpus (WP4-03) is insufficient. We need alignment at word level. Automatically retrieving word-to-word links is more difficult than obtaining sentence alignments. Subtitle words may occur in a different position than in the transcription and this makes it hard to match corresponding words. Generating a rough alignment and correcting this manually requires too much work. 2

3 sentences words copied deleted replaced CR train 9, ,519 93,506 22,986 9, % development 1,345 15,577 11,660 2,767 1, % test 1,314 15,605 11,737 2,836 1, % Table 2: Size of the three parts of the selected VRT news broadcast data measured in number of sentences and number of words. The final four columns show the relation between the transcriptions and the subtitles: the number of copied words (75%), the number of deleted words (18%), the number of replaced words (7%) and the average character compression rate: the percentage of remaining characters in the subtitles. In order to get some useful material, we have performed an automatic word alignment and discarded all sentences in which the alignment could be suspected to have been unsuccessful. Word alignment started with removing all punctuation signs and converting capital characters to lower case, thus simulating the future input of automatically transcribed spoken text. This cleanup action also made word alignment easier because there were inconsistencies in the punctuation and capitalization between the transcriptions and the subtitles. An overview of the other actions performed by the word alignment algorithm can be found in Table 1. After performing the word alignment, we have selected sentence pairs from the news corpus that contained sentences which shared at least half of their words or which contained at most three different words. This restriction was chosen in order to avoid including pairs in the training data that were difficult to learn from. Pairs of copied sentences have also been excluded from the material since those were uninteresting from a sentence simplification point of view. Of the parallel corpus, we have only used texts of the VRT news part because we wanted to create training material with a more-or-less consistent content. The data selection method has resulted in a collection of 12,535 sentences (156,701 words). We have divided this in a training part (9,876 sentences, 125,519 words), a development part (1,345 sentences, 15,577 words) and a test part (1,314 sentences, 15,605 words, see Table 2). Each word in the three data sets was specified by 48 features: the six features word, lemma, part-of-speech tag, chunk tag, relation tag and person name tag for the word and its six nearest neighbors as well the classes of the six nearest neighbors (estimated for the development and test data). Person name tags have been generated by a basic named entity tagger. For each sentence we have computed the compression rate: the number of characters in the subtitles divided by those in the transcripts. This compression rate will be a target for the machine learners. 3 Methods We have applied four different methods to the data. The first is a baseline system which simply generated the most frequent output class for each word. The second is the memorybased tagger MBT (Daelemans et al., 2003a). The third is the memory-based learner 3

4 TiMBL (Daelemans et al., 2003b) and the fourth is a list of hand-crafted deletion rules. In this section we will describe the third and fourth method in more detail. Memory-based learning involves storing training data items and assigning classes to test items which correspond with training items that are similar. There are different memorybased learning algorithms and different ways for computing item similarity. We have used the default settings of TiMBL: the ib1 algorithm with gain ratio feature weighting (Daelemans et al., 2003b). TiMBL has an important advantage over MBT: it allows data items with more than one feature while MBT is restricted to items with one feature. We have applied three extensions to the memory-based learner. The first is feature selection which we have implemented with bidirectional hill-climbing (Caruana and Freitag, 1994). It starts with evaluating each individual feature and continues with evaluating all pairs containing the best individual. This process is continued until adding or removing a single feature to the current set does not result in an increased performance. Feature selection is necessary because the chosen machine learner is sensitive to feature choice: it might perform better with a subset of features than with the complete set. The second extension to the memory-based learner which we employed was classifier stacking. We want to use of the classes of the neighbor words as features for the current word. In order to obtain these we ran preliminary experiments and used the generated output classes as context class features in the next experiments. The third extension was the use of trigram output classes (Van den Bosch et al., 2004). Rather than predicting the action for the current word, we predict the previous, current and next action. This allows the neighboring words to take part in deciding what action to take for the current word and this improves performance. The decision which action to perform is made by a majority vote in which ties are solved by choosing the action predicted by the focus word. The approach of the baseline and the two machine learners is to find the best action classes for each word in one step. The sentence simplification method based on hand-crafted deletion processes the data in two steps. First, it finds all words and phrases in the data that are candidates for deletion and then it selects phrases for deletion until the required compression rate is met. We believe that the first step could have been made with a learner as well but until now we have not been able to define a learning task for deleting all optional material from a sentence (note that we only have training examples in which a small part is deleted). We have defined different deletion rules, among others for removing adverbs, adjectives, first names, prepositional phrases, phrases between commas or brackets, relative clauses, numbers and time phrases. In order to avoid tuning the rules on the test data, we have written them based on other news text (the Dutch NOS Teletext). As an example, we show a simplified version rule for adverb deletion: if (currentpos == BW and not currentword is sentencefinal and not previousverb in baseverbs and not currentword in negations) then mark currentword as deletion candidate 4

5 average CR CR met Precision Recall F β=1 Baseline 94.9% 22.0% 51.0% 10.2% 17.0 MBT 90.3% 33.2% 38.9% 16.7% 23.4 TiMBL 86.7% 38.2% 32.4% 20.2% 24.9 Rules 69.4% 87.7% 26.0% 25.8% 25.9 TiMBL+Rules 70.6% 81.1% 25.7% 28.5% 27.0 Table 3: Sentence reduction performances of the memory-based tagger MBT, the memorybased learner TiMBL, the hand-crafted deletion rules and a combination of TiMBL and the rules, measured on the test data set by the average percentage of remaining characters (compression rate: CR), target: 81.8%), the percentage of sentences for which this compression rate was met, and precision, recall and F β=1 computed for word deletions and word replacements. The baseline system predicts the most frequent action for each word. After identifying all candidates for deletion, a selection of these phrases will be removed. The selection process starts with removing the shortest phrases, measured in number of words. If phrases have the same length, preference is given to phrases closer to the end of the sentence. The selection process continues until the required compression rate is met or until there are no more phrases left to delete. 4 Experiments and results We have performed feature selection for both MBT and TiMBL while training with the available training data and testing with the development data. Words, lemmas, part-ofspeech tags and chunk tags were selected for the memory-based learner but the relation information, person name information as well as the classes of the neighboring words were deemed unuseful. Performance was measured in F β=1 rate, the harmonic mean of precision and recall, for deletion and replacement actions. The feature sets corresponding with the best performances, wwfwaa,hnfaa for MBT and w 2 2 l 1 1 p 2 3 c 0 for TiMBL, have been used for processing the test data. An overview of the performances on this data set can be found in Table 3. As can be seen from the low baseline F β=1 rate, this is a difficult task. All approaches performed significantly better than the baseline with respect to F β=1 rates (p<0.01 according to a bootstrap sampling test (Noreen, 1989)). TiMBL was significantly better than MBT (p<0.05). The hand-crafted deletion rules did better than TiMBL, both with respect to the F β=1 rate (not significant, p 0.15) and the percentage of sentences that met the required compression rate (significant, p<0.01). Note that contrary to the deletion rules, neither MBT nor TiMBL had access to the required compression rates. The fact that the hand-crafted rules outperformed the two learners is interesting since the rules perform only part of the task (word deletion) while the learners also predict word replacements. We have created a basic combined system that chooses the deletions predicted by the rules and the replacements predicted by TiMBL. This system obtained a slightly 5

6 higher F β=1 rate than the handcrafted deletion rules (p 0.10, Table 3). In comparison with the rules, the combined system did not compress as many sentences sufficiently. In some cases word replacement introduced longer words and no extra post-processing has been performed to undo the effects of this. 5 Evaluation We have chosen to evaluate the output of the learners by examining the deleted and replaced words and comparing them with a gold standard. This will not always be fair for the learner, as the following example shows: O: in afwachting komt er nieuw overleg en met de landbouw en met de milieubeschermers en in de regering G: in afwachting komt er overleg met landbouw en milieubeschermers en in de regering R: er komt overleg en met de landbouw en met de milieubeschermers en in de regering The first sentence (O) is the original transcribed sentence, with the target words for deletion marked in italics, the second is the associated subtitle (Gold) and the third is the output of the rule-based system (R). The system output is fine: it satisfies the required compression rate and it is grammatically sound. Yet, only one of the three word deletions proposed by the system occurs in the subtitle (precision: 33%) while it suggests only one of the five deletions in the subtitle (recall: 20%; F β=1 : 25). According to the evaluation measure the suggested simplification is not very good. An excellent solution for this problem would be a manual evaluation of the system output. This evaluation can address different issues: whether the compression rate has been met, whether simplified sentence still contains the same information content as the original one and whether the simplification process has resulted in a grammatical sentence. However, in the process of system development we generate many different analyses of the development data and a repeated full manual evaluation of the results would be too costly. At this moment we are working on an intermediate evaluation process which uses several simplifications targets for each source sentence (Papineni et al., 2002). This requires extra manual work when building the test corpus but it will increase the probability of correctly assessing the performance of the learner. For this particular report, we have performed a manual evaluation of the output of two systems: the general memory-based learner TiMBL and the handcrafted deletion rules. We examined the first 200 reduced sentences they produced for the test data. Results were scored on a three-level scale (1.0: good, 0.5: medium, 0.0: bad) for syntactic and semantic correctness. Averages for both scales have been computed. A problem with this type of evaluation is that a passive approach to sentence reduction will result in high performance 6

7 all CR not met CR met nbr syn sem nbr syn sem nbr syn sem TiMBL % 70% % 78% 70 54% 56% Rules % 64% 26 89% 81% % 61% Table 4: Results of a manual evaluation of the output for the first 200 sentences of the test data. The output of the systems was evaluated with respect to syntactic correctness (syn) and semantic correctness (sem). Separate overviews are given for all sentences, sentences for which the required compression rate (CR) was not met and sentences for which this rate was met. rates with respect to the linguistic correctness of the output: by copying all words of the sentence we would get the original sentence which is expected to be grammatically correct. Therefore we have registered average performance rates for all sentences as well as for sentences for which the required compression rate was met and for sentences for which this was not the case. An overview of the results can be found in Table 4. When we compare the quality of the output sentences that met the required compression rate (final three columns of Table 4), we see that the handcrafted deletion rules produced better quality output, both with respect to syntactic correctness (p<0.01) and semantic correctness (p 0.15). However, the obtained performance rates, around 65%, are not spectacular. The evaluation performed in this report presupposes applying the systems in a stand-alone set-up without human post-processing. A more reasonable evaluation would be to test the systems in combination with human post-processing and measure the time gains which they provide in comparison with working without them. 6 Concluding remarks and future work We have presented work on sentence simplification targeted at the automatic generation of Dutch TV subtitles. A large part of the work was based on automatic learning from examples. From our training corpora, we have used the most interesting pairs of sentences from the corpus: those that were different but shared enough words to enable the learner to successfully reduce them. We have applied a memory-based tagger (MBT), a general memory-based learner (TiMBL) and a set of hand-crafted deletion rule to this data. The general learner outperformed the tagger and the rules did even better. A combination of the general learner (word replacements) and the rules (word deletion) achieved the best score on the data. However, we have subsequently shown that our evaluation procedure is not always fair to the systems. A manual evaluation of the output of two systems revealed that the quality of the rule-based system output was better than that of the general memory-based system. The quality figures were not very high (around 65%) and an additional processing time gain evaluation would be interesting. The compression rates obtained by the two learners (Table 3, column CR met) show that it is not a good idea to define sentence simplification as a single-stage task which can be 7

8 learned from examples. Contrary to the natural language processing tasks which we have worked on before, sentence simplification is more dynamic because in different contexts different compression rates might be required for the same sentence. Taking a corpus of sentence simplification output from humans, only gives examples of word actions which can be performed to reduce a sentence and not what actions must be performed. In a corpus with examples of low compression, the number of copied examples of a phrase will of outnumber the number of reduced versions. In those cases the learner chooses for the most frequent alternative which explains why they do not reach the required compression rates. We believe that the two-stage method which we have used in rule-based approach is better suited for sentence simplification: first find all candidate phrases for deletion and then choose a subset of these in such a way that the required compression rate is met. Ideally, the first step would have been performed by a machine learner. However, this requires a different parallel corpus of examples than the one we have used for our work: all deletion candidates need to be labeled rather than some. Additionally the corpus needs to contain a specification of links between the candidate phrases: some phrases in a sentence cannot be deleted without removing the phrases that depend on them as well. Constructing such a corpus as well as designing the associated learner is an interesting topic for future work. Another promising task that will lead to better sentence reduction is improving the applied shallow parser. Currently it generates only one relation, the verb-subject relation. It would be very useful would the parser identify a large range of verb-related phrases: agents, patients and modifiers of different types, like those containing time and space specifications. Such a shallow parser would allow learning sentence simplification from non-parallel corpora: from the relations present in the corpus the learner would discover which types of phrases are compulsory for a verb and which are optional. References Carroll, J., Minnen, G., Canning, Y., Devlin, S., and Tait, J. (1998). Practical Simplification of English Newspaper Text to Assist Aphasic Readers. In Proceedings of the AAAI-98 Workshop on Integrating AI and Assistive Technology. Madison, WI, USA. Caruana, R. and Freitag, D. (1994). Greedy Attribute Selection. In Proceedings of the Eleventh International Conference on Machine Learning, pages New Brunswick, NJ, USA, Morgan Kaufman. Chandrasekar, R. and Srinivas, B. (1997). Automatic Induction of Rules for Text Simplification. Knowledge-Based Systems, 10: Daelemans, W., Zavrel, J., Van der Sloot, K., and Van den Bosch, A. (2003a). MBT: Memory Based Tagger, version 2.0, Reference Guide. ILK Technical Report ILK Daelemans, W., Zavrel, J., Van der Sloot, K., and Van den Bosch, A. (2003b). TiMBL: Tilburg Memory Based Learner, version 5.0, Reference Guide. ILK Technical Report ILK Grefenstette, G. (1998). Producing Intelligent Telegraphic Text Reduction to Provide an Audio Scanning Service for the Blind. In Intelligent Text Summarization, AAAI Spring Symposium Series, pages

9 Hori, C., Furui, S., Malkin, R., Yu, H., and Waibel, A. (2002). Automatic Speech Summarization Applied to English Broadcast News Speech. In Proceedings of ICASSP 2002, pages Orlando, FL, USA. Jing, H. (2000). Sentence Reduction for Automatic Text Summarization. In Proceedings of ANLP- 2000, pages Seattle WA. Kiyota, Y. and Kurohashi, S. (2001). Automatic Summarization of Japanese Sentences and its Application to a WWW KWIC Index. In Proceedings of SAINT 2001, pages San Diego, CA, USA. Knight, K. and Marcu, D. (2000). Statistics-based Summarization Step One: Sentence Compression. In Proceedings of AAAI Austin TX. Noreen, E. W. (1989). Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages Riezler, S., King, T. H., Crouch, R., and Zaenen, A. (2003). Statistical Sentence Condensation Using Ambiguity Packing and Stochastic Disambiguation Methods for Lexical-functional Grammar. In Proceedings of HLT-NAACL 2003, pages Edmonton, Canada. Robert-Ribes, J., Pfeiffer, S., Ellison, R., and Burnham, D. (1999). Semi-automatic Captioning of TV Programs, an Australian Perspective. In Proceedings of TAO Workshop on TV Closed Captions for the Hearing-impaired People, pages Tokyo, Japan. Shinyama, Y., Sekine, S., Sudo, K., and Grishman, R. (2002). Automatic Paraphrase Acquisition from News Articles. In Proceedings of HLT San Diego, CA, USA. Tjong Kim Sang, E. F., Daelemans, W., and Höthker, A. (2004). Reduction of Dutch Sentences for Automatic Subtitling. Van den Bosch, A., Canisius, S., Daelemans, W., Hendrickx, I., and Tjong Kim Sang, E. (2004). Memory-based Semantic Role Labeling: Optimizing Features, Algorithm, and Output. In Proceedings of CoNLL Boston, MA, USA. 9

10 A Examples This section contains the first 15 sentences of the test data. For every sentence we show the original version (O), the gold standard simplified version (G), the simplified version generated by the memory-based learner TiMBL (T) and the simplified version produced by the hand-crafted deletion rules (R). Both the input and the output of the systems are lower-case sentences without punctuation signs. O: en er komt ook overleg met alle betrokken partijen G: er komt overleg met alle partijen T: en er komt ook overleg met alle betrokken partijen R: en er komt ook overleg O: dat lijkt goed nieuws voor de bezorgde boeren G: dat lijkt goed nieuws voor de boeren T: dat is goed nieuws voor de bezorgde boeren R: dat lijkt goed nieuws voor de boeren O: als dat meer dan 54 % is dan is dat zo als dat objectief kan vastgesteld worden G: als dat meer dan 54 % is dan is dat zo T: als dat meer dan 54 % is is dat zo als dat objectief kan vastgesteld worden R: als dat meer dan 54 % is dan is O: ik heb het gevoel dat er een beetje een symbooldiscussie is G: ik heb het gevoel dat er een symbooldiscussie is T: in heb het gevoel dat er een symbooldiscussie R: ik heb het gevoel O: en dat misschien bepaalde organisaties zeggen het moet zeker meer zijn dan 50 % want anders is het niet goed G: bepaalde organisaties zeggen het moet zeker meer dan 50 % zijn anders is het niet goed T: dat misschien bepaalde organisaties zeggen je moet zeker meer zijn dan 50 procent anders niet goed R: en dat organisaties zeggen het moet zeker meer zijn dan 50 % want is het niet goed O: voor mij kan dat zelfs 60 % zijn als dat tenminste wetenschappelijk onderbouwd is G: het mag 60 % zijn als het maar wetenschappelijk onderbouwd is T: voor mij kan dat zelfs 60 % zijn als dat tenminste onderbouwd is R: voor mij kan dat 60 % zijn als dat onderbouwd is O: in afwachting komt er nieuw overleg en met de landbouw en met de milieubeschermers en in de regering G: in afwachting komt er overleg met landbouw en milieubeschermers en in de regering T: in afwachting komt er nieuw overleg en met de landbouw en met de arbeid en in de regering R: er komt overleg en met de landbouw en met de milieubeschermers en in de regering 10

11 O: vanaf maandag wil ik eigenlijk met alle betrokken ministers met de ganse regering opnieuw rond de tafel om opnieuw die synthese te kunnen maken G: vanaf maandag wil ik met de ganse regering opnieuw rond de tafel om daar die synthese te kunnen maken T: vanaf maandag wil ik eigenlijk met alle betrokken ministers met de ganse regering opnieuw rond de tafel om opnieuw die synthese te kunnen komen R: ik wil eigenlijk met alle betrokken ministers met de ganse regering om die synthese te kunnen maken O: de hele discussie binnen de vlaamse regering heeft dus te maken met het europese milieubeleid G: de hele discussie heeft dus te maken met het europese milieubeleid T: de hele discussie binnen de vlaamse regering heeft te maken met het europese milieubeleid R: de hele discussie binnen de vlaamse regering heeft dus te maken O: lieven verstraete zocht uit wat europa nu eigenlijk wil G: lieven verstraete zocht uit wat europa wil T: lieven verstraete zocht uit wat zij wil R: lieven verstraete zocht uit wat europa wil O: tegen begin volgend jaar mag het grondwater nog maar 50 miligram nitraat per liter bevatten G: tegen 2003 mag het grondwater maar 50 milligram nitraat per liter bevatten T: tegen begin 2003 mag het grondwater nog maar 50 miligram nitraat per liter bevatten R: tegen begin volgend jaar mag het grondwater nog 50 miligram nitraat bevatten O: en de vlaamse landbouw met zijn uitgebreide veestapel is dus een belangrijke vervuiler G: de vlaamse landbouw met zijn grote veestapel is dus een belangrijke vervuiler T: de vlaamse landbouw met zijn uitgebreide veestapel is een belangrijke vervuiler R: en de vlaamse landbouw met zijn uitgebreide veestapel is dus een vervuiler O: om de nitraatnorm te halen kan een land kwetsbare gebieden afbakenen G: om de nitraat-norm te halen kan een land kwetsbare gebieden afbakenen T: om de nitraatnorm te halen kan kwetsbare gebieden afbakenen R: om de nitraatnorm te halen kan een land gebieden afbakenen O: daar mag dan 35 procent minder mest uitgereden worden dan elders G: daar mag dan 35 % minder mest uitgereden worden dan elders T: daar mag dan 35 % minder mest uitgereden worden dan elders R: daar mag 35 procent minder mest uitgereden worden dan elders O: tijdwinst dus G: tijdswinst dus T: tijdwinst R: tijdwinst dus 11