Sentence Simplification and Automatic Syntactic Analysis

Size: px
Start display at page:

Download "Sentence Simplification and Automatic Syntactic Analysis"

Transcription

1 Evaluation of Sentence Simplification by Machine Learning and Handcrafted Rules ATraNoS WP4-15 Erik F. Tjong Kim Sang CNTS - Language Technology Group University of Antwerp erikt@uia.ua.ac.be Abstract This progress report describes the Antwerp approach to sentence simplification in the framework of the ATraNoS project with a focus on the evaluation of the sentence simplification systems we have built. We describe earlier work in sentence simplification and explain our two approaches to sentence reduction: one based on machine learning while the other one relies on handcrafted deletion rules. We compare the performances of the systems and then discuss the applied evaluation method 1. 1 Introduction There exists a large body of work on text summarization but the largest part of this concerns extraction approaches in which a subset of the sentences in a text is proposed as a summary of the text. Most of the work on sentence simplification (also called sentence reduction, sentence condensation and sentence compression) relies on an automatic syntactic analysis after which handcrafted rules, sometimes with learned probabilities, remove optional branches in the syntactic trees. For evaluation, most systems rely on human judges which examine the grammaticality and the information content of the reduced sentence although some authors also propose automatic evaluation methods which are shown to correlate with the decisions made by the judges. Knight and Marcu (2000) compare two machine learners, a noisy channel model and decision trees, on compressing sentences by learning to map syntactic trees to other trees. After training on a corpus of about 1000 sentences, both systems performed significantly better than a baseline based on word bigram occurrences, according to a blind evaluation by human judges. Hori et al. (2002) compute various word and word relation scores, and build a compressed spoken sentence by finding an optimal set of words with dynamic programming. 2004). 1 This report is part of a paper submitted for the proceedings of CLIN-2003 (Tjong Kim Sang et al., 1

2 for each pair of transcriptions and subtitles remove punctuation and capitalization link every word to copies and synonyms remove duplicate links based on longest chains remove remaining crossing links link multi-word synonyms link neighboring unlinked words Table 1: High-level pseudo-code of the word alignment algorithm They show that this approach performs better than randomly choosing words from the original sentence, according to an automatic evaluation. Most other work on sentence condensation involves (shallow) parsing, sometimes with additional lexical and morphological tools, together with rule-based reduction, occasionally with statistical scores like tf idf (Carroll et al., 1998; Chandrasekar and Srinivas, 1997; Grefenstette, 1998; Jing, 2000; Kiyota and Kurohashi, 2001; Riezler et al., 2003). An interesting approach to sentence compression is by paraphrasing, that is replacing phrases by shorter phrases with the same meaning. Shinyama et al. (2002) outline how paraphrases can be derived automatically from related documents. The application of sentence simplification for automatically generating subtitles for TV programs is mentioned in Robert-Ribes et al. (1999) but to our knowledge there has been no practical study linking the two. 2 Data We have defined sentence simplification as a word classification task. In order to convert a transcribed sentence to a subtitle, four basic word actions can be taken: word copy, word deletion, word replacement and word insertion. In comparison with a subtitle, the words from the transcribed sentence can be copied, deleted or replaced by another word. Furthermore, words that do not appear in the transcription can be inserted in the subtitle. Since the subtitles are often quite similar to the transcription, the word copy action is the most frequent action. Here is an example from the corpus: de politici vinden de euro natuurlijk/delete een goeie/goede zaak In this sentence, two words need to be changed. The adverb natuurlijk needs to be deleted and the adjective goeie needs to be replaced by goede. In order to find out what word actions are required to transform the transcription to the subtitles, the sentence-level alignment of the parallel ATraNoS corpus (WP4-03) is insufficient. We need alignment at word level. Automatically retrieving word-to-word links is more difficult than obtaining sentence alignments. Subtitle words may occur in a different position than in the transcription and this makes it hard to match corresponding words. Generating a rough alignment and correcting this manually requires too much work. 2

3 sentences words copied deleted replaced CR train 9, ,519 93,506 22,986 9, % development 1,345 15,577 11,660 2,767 1, % test 1,314 15,605 11,737 2,836 1, % Table 2: Size of the three parts of the selected VRT news broadcast data measured in number of sentences and number of words. The final four columns show the relation between the transcriptions and the subtitles: the number of copied words (75%), the number of deleted words (18%), the number of replaced words (7%) and the average character compression rate: the percentage of remaining characters in the subtitles. In order to get some useful material, we have performed an automatic word alignment and discarded all sentences in which the alignment could be suspected to have been unsuccessful. Word alignment started with removing all punctuation signs and converting capital characters to lower case, thus simulating the future input of automatically transcribed spoken text. This cleanup action also made word alignment easier because there were inconsistencies in the punctuation and capitalization between the transcriptions and the subtitles. An overview of the other actions performed by the word alignment algorithm can be found in Table 1. After performing the word alignment, we have selected sentence pairs from the news corpus that contained sentences which shared at least half of their words or which contained at most three different words. This restriction was chosen in order to avoid including pairs in the training data that were difficult to learn from. Pairs of copied sentences have also been excluded from the material since those were uninteresting from a sentence simplification point of view. Of the parallel corpus, we have only used texts of the VRT news part because we wanted to create training material with a more-or-less consistent content. The data selection method has resulted in a collection of 12,535 sentences (156,701 words). We have divided this in a training part (9,876 sentences, 125,519 words), a development part (1,345 sentences, 15,577 words) and a test part (1,314 sentences, 15,605 words, see Table 2). Each word in the three data sets was specified by 48 features: the six features word, lemma, part-of-speech tag, chunk tag, relation tag and person name tag for the word and its six nearest neighbors as well the classes of the six nearest neighbors (estimated for the development and test data). Person name tags have been generated by a basic named entity tagger. For each sentence we have computed the compression rate: the number of characters in the subtitles divided by those in the transcripts. This compression rate will be a target for the machine learners. 3 Methods We have applied four different methods to the data. The first is a baseline system which simply generated the most frequent output class for each word. The second is the memorybased tagger MBT (Daelemans et al., 2003a). The third is the memory-based learner 3

4 TiMBL (Daelemans et al., 2003b) and the fourth is a list of hand-crafted deletion rules. In this section we will describe the third and fourth method in more detail. Memory-based learning involves storing training data items and assigning classes to test items which correspond with training items that are similar. There are different memorybased learning algorithms and different ways for computing item similarity. We have used the default settings of TiMBL: the ib1 algorithm with gain ratio feature weighting (Daelemans et al., 2003b). TiMBL has an important advantage over MBT: it allows data items with more than one feature while MBT is restricted to items with one feature. We have applied three extensions to the memory-based learner. The first is feature selection which we have implemented with bidirectional hill-climbing (Caruana and Freitag, 1994). It starts with evaluating each individual feature and continues with evaluating all pairs containing the best individual. This process is continued until adding or removing a single feature to the current set does not result in an increased performance. Feature selection is necessary because the chosen machine learner is sensitive to feature choice: it might perform better with a subset of features than with the complete set. The second extension to the memory-based learner which we employed was classifier stacking. We want to use of the classes of the neighbor words as features for the current word. In order to obtain these we ran preliminary experiments and used the generated output classes as context class features in the next experiments. The third extension was the use of trigram output classes (Van den Bosch et al., 2004). Rather than predicting the action for the current word, we predict the previous, current and next action. This allows the neighboring words to take part in deciding what action to take for the current word and this improves performance. The decision which action to perform is made by a majority vote in which ties are solved by choosing the action predicted by the focus word. The approach of the baseline and the two machine learners is to find the best action classes for each word in one step. The sentence simplification method based on hand-crafted deletion processes the data in two steps. First, it finds all words and phrases in the data that are candidates for deletion and then it selects phrases for deletion until the required compression rate is met. We believe that the first step could have been made with a learner as well but until now we have not been able to define a learning task for deleting all optional material from a sentence (note that we only have training examples in which a small part is deleted). We have defined different deletion rules, among others for removing adverbs, adjectives, first names, prepositional phrases, phrases between commas or brackets, relative clauses, numbers and time phrases. In order to avoid tuning the rules on the test data, we have written them based on other news text (the Dutch NOS Teletext). As an example, we show a simplified version rule for adverb deletion: if (currentpos == BW and not currentword is sentencefinal and not previousverb in baseverbs and not currentword in negations) then mark currentword as deletion candidate 4

5 average CR CR met Precision Recall F β=1 Baseline 94.9% 22.0% 51.0% 10.2% 17.0 MBT 90.3% 33.2% 38.9% 16.7% 23.4 TiMBL 86.7% 38.2% 32.4% 20.2% 24.9 Rules 69.4% 87.7% 26.0% 25.8% 25.9 TiMBL+Rules 70.6% 81.1% 25.7% 28.5% 27.0 Table 3: Sentence reduction performances of the memory-based tagger MBT, the memorybased learner TiMBL, the hand-crafted deletion rules and a combination of TiMBL and the rules, measured on the test data set by the average percentage of remaining characters (compression rate: CR), target: 81.8%), the percentage of sentences for which this compression rate was met, and precision, recall and F β=1 computed for word deletions and word replacements. The baseline system predicts the most frequent action for each word. After identifying all candidates for deletion, a selection of these phrases will be removed. The selection process starts with removing the shortest phrases, measured in number of words. If phrases have the same length, preference is given to phrases closer to the end of the sentence. The selection process continues until the required compression rate is met or until there are no more phrases left to delete. 4 Experiments and results We have performed feature selection for both MBT and TiMBL while training with the available training data and testing with the development data. Words, lemmas, part-ofspeech tags and chunk tags were selected for the memory-based learner but the relation information, person name information as well as the classes of the neighboring words were deemed unuseful. Performance was measured in F β=1 rate, the harmonic mean of precision and recall, for deletion and replacement actions. The feature sets corresponding with the best performances, wwfwaa,hnfaa for MBT and w 2 2 l 1 1 p 2 3 c 0 for TiMBL, have been used for processing the test data. An overview of the performances on this data set can be found in Table 3. As can be seen from the low baseline F β=1 rate, this is a difficult task. All approaches performed significantly better than the baseline with respect to F β=1 rates (p<0.01 according to a bootstrap sampling test (Noreen, 1989)). TiMBL was significantly better than MBT (p<0.05). The hand-crafted deletion rules did better than TiMBL, both with respect to the F β=1 rate (not significant, p 0.15) and the percentage of sentences that met the required compression rate (significant, p<0.01). Note that contrary to the deletion rules, neither MBT nor TiMBL had access to the required compression rates. The fact that the hand-crafted rules outperformed the two learners is interesting since the rules perform only part of the task (word deletion) while the learners also predict word replacements. We have created a basic combined system that chooses the deletions predicted by the rules and the replacements predicted by TiMBL. This system obtained a slightly 5

6 higher F β=1 rate than the handcrafted deletion rules (p 0.10, Table 3). In comparison with the rules, the combined system did not compress as many sentences sufficiently. In some cases word replacement introduced longer words and no extra post-processing has been performed to undo the effects of this. 5 Evaluation We have chosen to evaluate the output of the learners by examining the deleted and replaced words and comparing them with a gold standard. This will not always be fair for the learner, as the following example shows: O: in afwachting komt er nieuw overleg en met de landbouw en met de milieubeschermers en in de regering G: in afwachting komt er overleg met landbouw en milieubeschermers en in de regering R: er komt overleg en met de landbouw en met de milieubeschermers en in de regering The first sentence (O) is the original transcribed sentence, with the target words for deletion marked in italics, the second is the associated subtitle (Gold) and the third is the output of the rule-based system (R). The system output is fine: it satisfies the required compression rate and it is grammatically sound. Yet, only one of the three word deletions proposed by the system occurs in the subtitle (precision: 33%) while it suggests only one of the five deletions in the subtitle (recall: 20%; F β=1 : 25). According to the evaluation measure the suggested simplification is not very good. An excellent solution for this problem would be a manual evaluation of the system output. This evaluation can address different issues: whether the compression rate has been met, whether simplified sentence still contains the same information content as the original one and whether the simplification process has resulted in a grammatical sentence. However, in the process of system development we generate many different analyses of the development data and a repeated full manual evaluation of the results would be too costly. At this moment we are working on an intermediate evaluation process which uses several simplifications targets for each source sentence (Papineni et al., 2002). This requires extra manual work when building the test corpus but it will increase the probability of correctly assessing the performance of the learner. For this particular report, we have performed a manual evaluation of the output of two systems: the general memory-based learner TiMBL and the handcrafted deletion rules. We examined the first 200 reduced sentences they produced for the test data. Results were scored on a three-level scale (1.0: good, 0.5: medium, 0.0: bad) for syntactic and semantic correctness. Averages for both scales have been computed. A problem with this type of evaluation is that a passive approach to sentence reduction will result in high performance 6

7 all CR not met CR met nbr syn sem nbr syn sem nbr syn sem TiMBL % 70% % 78% 70 54% 56% Rules % 64% 26 89% 81% % 61% Table 4: Results of a manual evaluation of the output for the first 200 sentences of the test data. The output of the systems was evaluated with respect to syntactic correctness (syn) and semantic correctness (sem). Separate overviews are given for all sentences, sentences for which the required compression rate (CR) was not met and sentences for which this rate was met. rates with respect to the linguistic correctness of the output: by copying all words of the sentence we would get the original sentence which is expected to be grammatically correct. Therefore we have registered average performance rates for all sentences as well as for sentences for which the required compression rate was met and for sentences for which this was not the case. An overview of the results can be found in Table 4. When we compare the quality of the output sentences that met the required compression rate (final three columns of Table 4), we see that the handcrafted deletion rules produced better quality output, both with respect to syntactic correctness (p<0.01) and semantic correctness (p 0.15). However, the obtained performance rates, around 65%, are not spectacular. The evaluation performed in this report presupposes applying the systems in a stand-alone set-up without human post-processing. A more reasonable evaluation would be to test the systems in combination with human post-processing and measure the time gains which they provide in comparison with working without them. 6 Concluding remarks and future work We have presented work on sentence simplification targeted at the automatic generation of Dutch TV subtitles. A large part of the work was based on automatic learning from examples. From our training corpora, we have used the most interesting pairs of sentences from the corpus: those that were different but shared enough words to enable the learner to successfully reduce them. We have applied a memory-based tagger (MBT), a general memory-based learner (TiMBL) and a set of hand-crafted deletion rule to this data. The general learner outperformed the tagger and the rules did even better. A combination of the general learner (word replacements) and the rules (word deletion) achieved the best score on the data. However, we have subsequently shown that our evaluation procedure is not always fair to the systems. A manual evaluation of the output of two systems revealed that the quality of the rule-based system output was better than that of the general memory-based system. The quality figures were not very high (around 65%) and an additional processing time gain evaluation would be interesting. The compression rates obtained by the two learners (Table 3, column CR met) show that it is not a good idea to define sentence simplification as a single-stage task which can be 7

8 learned from examples. Contrary to the natural language processing tasks which we have worked on before, sentence simplification is more dynamic because in different contexts different compression rates might be required for the same sentence. Taking a corpus of sentence simplification output from humans, only gives examples of word actions which can be performed to reduce a sentence and not what actions must be performed. In a corpus with examples of low compression, the number of copied examples of a phrase will of outnumber the number of reduced versions. In those cases the learner chooses for the most frequent alternative which explains why they do not reach the required compression rates. We believe that the two-stage method which we have used in rule-based approach is better suited for sentence simplification: first find all candidate phrases for deletion and then choose a subset of these in such a way that the required compression rate is met. Ideally, the first step would have been performed by a machine learner. However, this requires a different parallel corpus of examples than the one we have used for our work: all deletion candidates need to be labeled rather than some. Additionally the corpus needs to contain a specification of links between the candidate phrases: some phrases in a sentence cannot be deleted without removing the phrases that depend on them as well. Constructing such a corpus as well as designing the associated learner is an interesting topic for future work. Another promising task that will lead to better sentence reduction is improving the applied shallow parser. Currently it generates only one relation, the verb-subject relation. It would be very useful would the parser identify a large range of verb-related phrases: agents, patients and modifiers of different types, like those containing time and space specifications. Such a shallow parser would allow learning sentence simplification from non-parallel corpora: from the relations present in the corpus the learner would discover which types of phrases are compulsory for a verb and which are optional. References Carroll, J., Minnen, G., Canning, Y., Devlin, S., and Tait, J. (1998). Practical Simplification of English Newspaper Text to Assist Aphasic Readers. In Proceedings of the AAAI-98 Workshop on Integrating AI and Assistive Technology. Madison, WI, USA. Caruana, R. and Freitag, D. (1994). Greedy Attribute Selection. In Proceedings of the Eleventh International Conference on Machine Learning, pages New Brunswick, NJ, USA, Morgan Kaufman. Chandrasekar, R. and Srinivas, B. (1997). Automatic Induction of Rules for Text Simplification. Knowledge-Based Systems, 10: Daelemans, W., Zavrel, J., Van der Sloot, K., and Van den Bosch, A. (2003a). MBT: Memory Based Tagger, version 2.0, Reference Guide. ILK Technical Report ILK Daelemans, W., Zavrel, J., Van der Sloot, K., and Van den Bosch, A. (2003b). TiMBL: Tilburg Memory Based Learner, version 5.0, Reference Guide. ILK Technical Report ILK Grefenstette, G. (1998). Producing Intelligent Telegraphic Text Reduction to Provide an Audio Scanning Service for the Blind. In Intelligent Text Summarization, AAAI Spring Symposium Series, pages

9 Hori, C., Furui, S., Malkin, R., Yu, H., and Waibel, A. (2002). Automatic Speech Summarization Applied to English Broadcast News Speech. In Proceedings of ICASSP 2002, pages Orlando, FL, USA. Jing, H. (2000). Sentence Reduction for Automatic Text Summarization. In Proceedings of ANLP- 2000, pages Seattle WA. Kiyota, Y. and Kurohashi, S. (2001). Automatic Summarization of Japanese Sentences and its Application to a WWW KWIC Index. In Proceedings of SAINT 2001, pages San Diego, CA, USA. Knight, K. and Marcu, D. (2000). Statistics-based Summarization Step One: Sentence Compression. In Proceedings of AAAI Austin TX. Noreen, E. W. (1989). Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages Riezler, S., King, T. H., Crouch, R., and Zaenen, A. (2003). Statistical Sentence Condensation Using Ambiguity Packing and Stochastic Disambiguation Methods for Lexical-functional Grammar. In Proceedings of HLT-NAACL 2003, pages Edmonton, Canada. Robert-Ribes, J., Pfeiffer, S., Ellison, R., and Burnham, D. (1999). Semi-automatic Captioning of TV Programs, an Australian Perspective. In Proceedings of TAO Workshop on TV Closed Captions for the Hearing-impaired People, pages Tokyo, Japan. Shinyama, Y., Sekine, S., Sudo, K., and Grishman, R. (2002). Automatic Paraphrase Acquisition from News Articles. In Proceedings of HLT San Diego, CA, USA. Tjong Kim Sang, E. F., Daelemans, W., and Höthker, A. (2004). Reduction of Dutch Sentences for Automatic Subtitling. Van den Bosch, A., Canisius, S., Daelemans, W., Hendrickx, I., and Tjong Kim Sang, E. (2004). Memory-based Semantic Role Labeling: Optimizing Features, Algorithm, and Output. In Proceedings of CoNLL Boston, MA, USA. 9

10 A Examples This section contains the first 15 sentences of the test data. For every sentence we show the original version (O), the gold standard simplified version (G), the simplified version generated by the memory-based learner TiMBL (T) and the simplified version produced by the hand-crafted deletion rules (R). Both the input and the output of the systems are lower-case sentences without punctuation signs. O: en er komt ook overleg met alle betrokken partijen G: er komt overleg met alle partijen T: en er komt ook overleg met alle betrokken partijen R: en er komt ook overleg O: dat lijkt goed nieuws voor de bezorgde boeren G: dat lijkt goed nieuws voor de boeren T: dat is goed nieuws voor de bezorgde boeren R: dat lijkt goed nieuws voor de boeren O: als dat meer dan 54 % is dan is dat zo als dat objectief kan vastgesteld worden G: als dat meer dan 54 % is dan is dat zo T: als dat meer dan 54 % is is dat zo als dat objectief kan vastgesteld worden R: als dat meer dan 54 % is dan is O: ik heb het gevoel dat er een beetje een symbooldiscussie is G: ik heb het gevoel dat er een symbooldiscussie is T: in heb het gevoel dat er een symbooldiscussie R: ik heb het gevoel O: en dat misschien bepaalde organisaties zeggen het moet zeker meer zijn dan 50 % want anders is het niet goed G: bepaalde organisaties zeggen het moet zeker meer dan 50 % zijn anders is het niet goed T: dat misschien bepaalde organisaties zeggen je moet zeker meer zijn dan 50 procent anders niet goed R: en dat organisaties zeggen het moet zeker meer zijn dan 50 % want is het niet goed O: voor mij kan dat zelfs 60 % zijn als dat tenminste wetenschappelijk onderbouwd is G: het mag 60 % zijn als het maar wetenschappelijk onderbouwd is T: voor mij kan dat zelfs 60 % zijn als dat tenminste onderbouwd is R: voor mij kan dat 60 % zijn als dat onderbouwd is O: in afwachting komt er nieuw overleg en met de landbouw en met de milieubeschermers en in de regering G: in afwachting komt er overleg met landbouw en milieubeschermers en in de regering T: in afwachting komt er nieuw overleg en met de landbouw en met de arbeid en in de regering R: er komt overleg en met de landbouw en met de milieubeschermers en in de regering 10

11 O: vanaf maandag wil ik eigenlijk met alle betrokken ministers met de ganse regering opnieuw rond de tafel om opnieuw die synthese te kunnen maken G: vanaf maandag wil ik met de ganse regering opnieuw rond de tafel om daar die synthese te kunnen maken T: vanaf maandag wil ik eigenlijk met alle betrokken ministers met de ganse regering opnieuw rond de tafel om opnieuw die synthese te kunnen komen R: ik wil eigenlijk met alle betrokken ministers met de ganse regering om die synthese te kunnen maken O: de hele discussie binnen de vlaamse regering heeft dus te maken met het europese milieubeleid G: de hele discussie heeft dus te maken met het europese milieubeleid T: de hele discussie binnen de vlaamse regering heeft te maken met het europese milieubeleid R: de hele discussie binnen de vlaamse regering heeft dus te maken O: lieven verstraete zocht uit wat europa nu eigenlijk wil G: lieven verstraete zocht uit wat europa wil T: lieven verstraete zocht uit wat zij wil R: lieven verstraete zocht uit wat europa wil O: tegen begin volgend jaar mag het grondwater nog maar 50 miligram nitraat per liter bevatten G: tegen 2003 mag het grondwater maar 50 milligram nitraat per liter bevatten T: tegen begin 2003 mag het grondwater nog maar 50 miligram nitraat per liter bevatten R: tegen begin volgend jaar mag het grondwater nog 50 miligram nitraat bevatten O: en de vlaamse landbouw met zijn uitgebreide veestapel is dus een belangrijke vervuiler G: de vlaamse landbouw met zijn grote veestapel is dus een belangrijke vervuiler T: de vlaamse landbouw met zijn uitgebreide veestapel is een belangrijke vervuiler R: en de vlaamse landbouw met zijn uitgebreide veestapel is dus een vervuiler O: om de nitraatnorm te halen kan een land kwetsbare gebieden afbakenen G: om de nitraat-norm te halen kan een land kwetsbare gebieden afbakenen T: om de nitraatnorm te halen kan kwetsbare gebieden afbakenen R: om de nitraatnorm te halen kan een land gebieden afbakenen O: daar mag dan 35 procent minder mest uitgereden worden dan elders G: daar mag dan 35 % minder mest uitgereden worden dan elders T: daar mag dan 35 % minder mest uitgereden worden dan elders R: daar mag 35 procent minder mest uitgereden worden dan elders O: tijdwinst dus G: tijdswinst dus T: tijdwinst R: tijdwinst dus 11

Reduction of Dutch Sentences for Automatic Subtitling

Reduction of Dutch Sentences for Automatic Subtitling Reduction of Dutch Sentences for Automatic Subtitling Erik F. Tjong Kim Sang, Walter Daelemans and Anja Höthker CNTS Language Technology Group, University of Antwerp, Belgium Abstract We compare machine

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch

Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch Erwin Marsi & Emiel Krahmer CLIN 2007, Nijmegen What is semantic overlap? Zangeres Christina Aguilera heeft eindelijk verteld

More information

The University of Amsterdam s Question Answering System at QA@CLEF 2007

The University of Amsterdam s Question Answering System at QA@CLEF 2007 The University of Amsterdam s Question Answering System at QA@CLEF 2007 Valentin Jijkoun, Katja Hofmann, David Ahn, Mahboob Alam Khalid, Joris van Rantwijk, Maarten de Rijke, and Erik Tjong Kim Sang ISLA,

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Annotation Guidelines for Dutch-English Word Alignment

Annotation Guidelines for Dutch-English Word Alignment Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College

More information

Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining? Text Mining 2004-2005 Master TKI Antal van den Bosch en Walter Daelemans http://ilk.uvt.nl/~antalb/textmining/ Dinsdag, 10.45-12.30, SZ33 Timeline (1) [1 februari 2005] Introductie (WD) [15 februari 2005]

More information

How To Identify And Represent Multiword Expressions (Mwe) In A Multiword Expression (Irme)

How To Identify And Represent Multiword Expressions (Mwe) In A Multiword Expression (Irme) The STEVIN IRME Project Jan Odijk STEVIN Midterm Workshop Rotterdam, June 27, 2008 IRME Identification and lexical Representation of Multiword Expressions (MWEs) Participants: Uil-OTS, Utrecht Nicole Grégoire,

More information

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

Extraction of Hypernymy Information from Text

Extraction of Hypernymy Information from Text Extraction of Hypernymy Information from Text Erik Tjong Kim Sang, Katja Hofmann and Maarten de Rijke Abstract We present the results of three different studies in extracting hypernymy information from

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Daan & Rembrandt Research Wendelien Daan By Willemijn Jongbloed Group D October 2009

Daan & Rembrandt Research Wendelien Daan By Willemijn Jongbloed Group D October 2009 Daan & Rembrandt Research Wendelien Daan By Willemijn Jongbloed Group D October 2009 Doing Dutch Often, the work of many Dutch artist; photographers, designers, and painters is easy to recognize among

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

Coreference Resolution on Blogs and Commented News

Coreference Resolution on Blogs and Commented News Coreference Resolution on Blogs and Commented News Iris Hendrickx 1 and Veronique Hoste 1,2 1 LT3 - Language and Translation Technology Team, University College Ghent, Groot-Brittanniëlaan 45, Ghent, Belgium

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp

More information

SAND: Relation between the Database and Printed Maps

SAND: Relation between the Database and Printed Maps SAND: Relation between the Database and Printed Maps Erik Tjong Kim Sang Meertens Institute erik.tjong.kim.sang@meertens.knaw.nl May 16, 2014 1 Introduction SAND, the Syntactic Atlas of the Dutch Dialects,

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht, 2015-11-10

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht, 2015-11-10 Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht, 2015-11-10 1 Overview Introduction Search in Corpora and Lexicons Search in PoS-tagged Corpus Search for grammatical relations Search for

More information

CLARIN project DiscAn :

CLARIN project DiscAn : CLARIN project DiscAn : Towards a Discourse Annotation system for Dutch language corpora Ted Sanders Kirsten Vis Utrecht Institute of Linguistics Utrecht University Daan Broeder TLA Max-Planck Institute

More information

Finding Syntactic Characteristics of Surinamese Dutch

Finding Syntactic Characteristics of Surinamese Dutch Finding Syntactic Characteristics of Surinamese Dutch Erik Tjong Kim Sang Meertens Institute erikt(at)xs4all.nl June 13, 2014 1 Introduction Surinamese Dutch is a variant of Dutch spoken in Suriname, a

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

~ We are all goddesses, the only problem is that we forget that when we grow up ~

~ We are all goddesses, the only problem is that we forget that when we grow up ~ ~ We are all goddesses, the only problem is that we forget that when we grow up ~ This brochure is Deze brochure is in in English and Dutch het Engels en Nederlands Come and re-discover your divine self

More information

A chart generator for the Dutch Alpino grammar

A chart generator for the Dutch Alpino grammar June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:

More information

Mining Opinion Features in Customer Reviews

Mining Opinion Features in Customer Reviews Mining Opinion Features in Customer Reviews Minqing Hu and Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 {mhu1, liub}@cs.uic.edu

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

Leveraging ASEAN Economic Community through Language Translation Services

Leveraging ASEAN Economic Community through Language Translation Services Leveraging ASEAN Economic Community through Language Translation Services Hammam Riza Center for Information and Communication Technology Agency for the Assessment and Application of Technology (BPPT)

More information

Relationele Databases 2002/2003

Relationele Databases 2002/2003 1 Relationele Databases 2002/2003 Hoorcollege 5 22 mei 2003 Jaap Kamps & Maarten de Rijke April Juli 2003 Plan voor Vandaag Praktische dingen 3.8, 3.9, 3.10, 4.1, 4.4 en 4.5 SQL Aantekeningen 3 Meer Queries.

More information

Live subtitling with speech recognition: Causes and consequences of revisions in the production process

Live subtitling with speech recognition: Causes and consequences of revisions in the production process Live subtitling with speech recognition: Causes and consequences of revisions in the production process Luuk Van Waes, Mariëlle Leijten & Aline Remael Master in de Meertalige Professionele Communicatie

More information

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Parsing Software Requirements with an Ontology-based Semantic Role Labeler Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh mroth@inf.ed.ac.uk Ewan Klein University of Edinburgh ewan@inf.ed.ac.uk Abstract Software

More information

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed

More information

Title page. Title: A conversation-analytic study of time-outs in the Dutch national volleyball. competition. Author: Maarten Breyten van der Meulen

Title page. Title: A conversation-analytic study of time-outs in the Dutch national volleyball. competition. Author: Maarten Breyten van der Meulen Title page Title: A conversation-analytic study of time-outs in the Dutch national volleyball competition Author: Maarten Breyten van der Meulen A conversation-analytic study of time-outs in the Dutch

More information

IP-NBM. Copyright Capgemini 2012. All Rights Reserved

IP-NBM. Copyright Capgemini 2012. All Rights Reserved IP-NBM 1 De bescheidenheid van een schaker 2 Maar wat betekent dat nu 3 De drie elementen richting onsterfelijkheid Genomics Artifical Intelligence (nano)robotics 4 De impact van automatisering en robotisering

More information

Context Grammar and POS Tagging

Context Grammar and POS Tagging Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 dick.chen@lexisnexis.com don.loritz@lexisnexis.com

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Learning Morphological Disambiguation Rules for Turkish

Learning Morphological Disambiguation Rules for Turkish Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Dept. of Computer Engineering Koç University İstanbul, Turkey dyuret@ku.edu.tr Ferhan Türe Dept. of Computer Engineering Koç University

More information

Named Entity Recognition in Broadcast News Using Similar Written Texts

Named Entity Recognition in Broadcast News Using Similar Written Texts Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium niraj.shrestha@cs.kuleuven.be ivan.vulic@@cs.kuleuven.be Abstract

More information

Example-based Translation without Parallel Corpora: First experiments on a prototype

Example-based Translation without Parallel Corpora: First experiments on a prototype -based Translation without Parallel Corpora: First experiments on a prototype Vincent Vandeghinste, Peter Dirix and Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven Maria

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Hybrid Machine Translation Guided by a Rule Based System

Hybrid Machine Translation Guided by a Rule Based System Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

Transition-Based Dependency Parsing with Long Distance Collocations

Transition-Based Dependency Parsing with Long Distance Collocations Transition-Based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu (B), and Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,

More information

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer

More information

Management control in creative firms

Management control in creative firms Management control in creative firms Nathalie Beckers Lessius KU Leuven Martine Cools Lessius KU Leuven- Rotterdam School of Management Alexandra Van den Abbeele KU Leuven Leerstoel Tindemans - February

More information

Hybrid Strategies. for better products and shorter time-to-market

Hybrid Strategies. for better products and shorter time-to-market Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

TS3: an Improved Version of the Bilingual Concordancer TransSearch

TS3: an Improved Version of the Bilingual Concordancer TransSearch TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

PoliticalMashup. Make implicit structure and information explicit. Content

PoliticalMashup. Make implicit structure and information explicit. Content 1 2 Content Connecting promises and actions of politicians and how the society reacts on them Maarten Marx Universiteit van Amsterdam Overview project Zooming in on one cultural heritage dataset A few

More information

The information in this report is confidential. So keep this report in a safe place!

The information in this report is confidential. So keep this report in a safe place! Bram Voorbeeld About this Bridge 360 report 2 CONTENT About this Bridge 360 report... 2 Introduction to the Bridge 360... 3 About the Bridge 360 Profile...4 Bridge Behaviour Profile-Directing...6 Bridge

More information

OGH: : 11g in de praktijk

OGH: : 11g in de praktijk OGH: : 11g in de praktijk Real Application Testing SPREKER : E-MAIL : PATRICK MUNNE PMUNNE@TRANSFER-SOLUTIONS.COM DATUM : 14-09-2010 WWW.TRANSFER-SOLUTIONS.COM Real Application Testing Uitleg Real Application

More information

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande

More information

Annotation and Evaluation of Swedish Multiword Named Entities

Annotation and Evaluation of Swedish Multiword Named Entities Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se Introduction

More information

The TCH Machine Translation System for IWSLT 2008

The TCH Machine Translation System for IWSLT 2008 The TCH Machine Translation System for IWSLT 2008 Haifeng Wang, Hua Wu, Xiaoguang Hu, Zhanyi Liu, Jianfeng Li, Dengjun Ren, Zhengyu Niu Toshiba (China) Research and Development Center 5/F., Tower W2, Oriental

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1

More information

A Method for Automatic De-identification of Medical Records

A Method for Automatic De-identification of Medical Records A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA tafvizi@csail.mit.edu Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA mpacula@csail.mit.edu Abstract

More information

Factored Translation Models

Factored Translation Models Factored Translation s Philipp Koehn and Hieu Hoang pkoehn@inf.ed.ac.uk, H.Hoang@sms.ed.ac.uk School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom

More information

LEXICOGRAPHIC ISSUES IN COMBINATORICS

LEXICOGRAPHIC ISSUES IN COMBINATORICS LEXICOGRAPHIC ISSUES IN COMBINATORICS Editing multiple idiomatic phraseological units for book and CD-ROM Lineke OPPENTOCHT, Utrecht, The Netherlands Abstract In dictionaries dating from pre-computerized

More information

Analysis and Synthesis of Help-desk Responses

Analysis and Synthesis of Help-desk Responses Analysis and Synthesis of Help-desk s Yuval Marom and Ingrid Zukerman School of Computer Science and Software Engineering Monash University Clayton, VICTORIA 3800, AUSTRALIA {yuvalm,ingrid}@csse.monash.edu.au

More information

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:

More information

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging 2 Words taken isolatedly are ambiguous regarding its POS Yo bajo con el hombre bajo a PP AQ

More information

101 Inspirerende Quotes - Eelco de boer - winst.nl/ebooks/ Inleiding

101 Inspirerende Quotes - Eelco de boer - winst.nl/ebooks/ Inleiding Inleiding Elke keer dat ik een uitspraak of een inspirerende quote hoor of zie dan noteer ik deze (iets wat ik iedereen aanraad om te doen). Ik ben hier even doorheen gegaan en het leek me een leuk idee

More information

ETL Ensembles for Chunking, NER and SRL

ETL Ensembles for Chunking, NER and SRL ETL Ensembles for Chunking, NER and SRL Cícero N. dos Santos 1, Ruy L. Milidiú 2, Carlos E. M. Crestana 2, and Eraldo R. Fernandes 2,3 1 Mestrado em Informática Aplicada MIA Universidade de Fortaleza UNIFOR

More information

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies

More information

Download Check My Words from: http://mywords.ust.hk/cmw/

Download Check My Words from: http://mywords.ust.hk/cmw/ Grammar Checking Press the button on the Check My Words toolbar to see what common errors learners make with a word and to see all members of the word family. Press the Check button to check for common

More information

Biology Based Alignments of Paraphrases for Sentence Compression

Biology Based Alignments of Paraphrases for Sentence Compression Biology Based Alignments of Paraphrases for Sentence Compression João Cordeiro CLT and Bioinformatics University of Beira Interior Covilhã, Portugal jpaulo@di.ubi.pt Gäel Dias CLT and Bioinformatics University

More information

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability Ana-Maria Popescu Alex Armanasu Oren Etzioni University of Washington David Ko {amp, alexarm, etzioni,

More information

Text Generation for Abstractive Summarization

Text Generation for Abstractive Summarization Text Generation for Abstractive Summarization Pierre-Etienne Genest, Guy Lapalme RALI-DIRO Université de Montréal P.O. Box 6128, Succ. Centre-Ville Montréal, Québec Canada, H3C 3J7 {genestpe,lapalme}@iro.umontreal.ca

More information

employager 1.0 design challenge

employager 1.0 design challenge employager 1.0 design challenge a voyage to employ(ment) EMPLOYAGER 1.0 On the initiative of the City of Eindhoven, the Red Bluejay Foundation organizes a new design challenge around people with a distance

More information

A Flexible Online Server for Machine Translation Evaluation

A Flexible Online Server for Machine Translation Evaluation A Flexible Online Server for Machine Translation Evaluation Matthias Eck, Stephan Vogel, and Alex Waibel InterACT Research Carnegie Mellon University Pittsburgh, PA, 15213, USA {matteck, vogel, waibel}@cs.cmu.edu

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

How To Design A 3D Model In A Computer Program

How To Design A 3D Model In A Computer Program Concept Design Gert Landheer Mark van den Brink Koen van Boerdonk Content Richness of Data Concept Design Fast creation of rich data which eventually can be used to create a final model Creo Product Family

More information

English Grammar Checker

English Grammar Checker International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,

More information

Using the BNC to create and develop educational materials and a website for learners of English

Using the BNC to create and develop educational materials and a website for learners of English Using the BNC to create and develop educational materials and a website for learners of English Danny Minn a, Hiroshi Sano b, Marie Ino b and Takahiro Nakamura c a Kitakyushu University b Tokyo University

More information

Shallow Parsing with PoS Taggers and Linguistic Features

Shallow Parsing with PoS Taggers and Linguistic Features Journal of Machine Learning Research 2 (2002) 639 668 Submitted 9/01; Published 3/02 Shallow Parsing with PoS Taggers and Linguistic Features Beáta Megyesi Centre for Speech Technology (CTT) Department

More information

Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages

Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages Hiroshi URAE, Taro TEZUKA, Fuminori KIMURA, and Akira MAEDA Abstract Translating webpages by machine translation is the

More information

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5

More information

Text Opinion Mining to Analyze News for Stock Market Prediction

Text Opinion Mining to Analyze News for Stock Market Prediction Int. J. Advance. Soft Comput. Appl., Vol. 6, No. 1, March 2014 ISSN 2074-8523; Copyright SCRG Publication, 2014 Text Opinion Mining to Analyze News for Stock Market Prediction Yoosin Kim 1, Seung Ryul

More information

EEN HUIS BESTUREN ALS EEN FABRIEK,

EEN HUIS BESTUREN ALS EEN FABRIEK, EEN HUIS BESTUREN ALS EEN FABRIEK, HOE DOE JE DAT? Henk Akkermans World Class Maintenance & Tilburg University Lezing HomeLab 2050, KIVI, 6 oktober, 2015 The opportunity: an industrial revolution is happening

More information

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information

Verticale tuin maken van een pallet

Verticale tuin maken van een pallet Daar krijg je dan je groendakmaterialen geleverd op een pallet. Waar moet je naar toe met zo'n pallet. Retour brengen? Gebruiken in de open haard? Maar je kunt creatiever zijn met pallets. Zo kun je er

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

The Importance of Collaboration

The Importance of Collaboration Welkom in de wereld van EDI en de zakelijke kansen op langer termijn Sectorsessie mode 23 maart 2016 ISRID VAN GEUNS IS WORKS IS BOUTIQUES Let s get connected! Triumph Without EDI Triumph Let s get connected

More information

A Multi-document Summarization System for Sociology Dissertation Abstracts: Design, Implementation and Evaluation

A Multi-document Summarization System for Sociology Dissertation Abstracts: Design, Implementation and Evaluation A Multi-document Summarization System for Sociology Dissertation Abstracts: Design, Implementation and Evaluation Shiyan Ou, Christopher S.G. Khoo, and Dion H. Goh Division of Information Studies School

More information

Specification by Example (methoden, technieken en tools) Remco Snelders Product owner & Business analyst

Specification by Example (methoden, technieken en tools) Remco Snelders Product owner & Business analyst Specification by Example (methoden, technieken en tools) Remco Snelders Product owner & Business analyst Terminologie Specification by Example (SBE) Acceptance Test Driven Development (ATDD) Behaviour

More information

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan 7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation

More information