SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
|
|
|
- Gladys Green
- 10 years ago
- Views:
Transcription
1 SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer Timur Gilmanov, Olga Scrivner, Sandra Kübler Indiana University Bloomington, IN, USA {timugilm,obscrivn,skuebler}@indiana.edu Abstract It is well known that word aligned parallel corpora are valuable linguistic resources. Since many factors affect automatic alignment quality, manual post-editing may be required in some applications. While there are several state-of-the-art word-aligners, such as GIZA++ and Berkeley, there is no simple visual tool that would enable correcting and editing aligned corpora of different formats. We have developed SWIFT Aligner, a free, portable software that allows for visual representation and editing of aligned corpora from several most commonly used formats: TALP, GIZA, and NAACL. In addition, our tool has incorporated part-of-speech and syntactic dependency transfer from an annotated source language into an unannotated target language, by means of word-alignment. Keywords: word alignment, parallel corpus, cross-language transfer 1. Introduction and Goals In recent years, parallel word alignment has become a stable in machine translation, and word alignment methods such as GIZA++ (Och and Ney, 2000) have achieved significant advancement. Several factors have contributed to this achievement, namely the availability of large amounts of parallel data as well as state-of-the-art statistical algorithms in modeling and evaluation (Knight and Marcu, 2004). However, there are research areas other than machine translation, which benefit from or even depend upon word alignment. These include bilingual lexicography, multilingual word sense disambiguation, corpus-based translation studies, and cross-language transfer, among others. While machine translation takes word alignment as a given, for all the other applications, an accurate alignment on the word level is essential, and users are more willing to perform manual post-correction. Since many factors, such as genre, closeness of translation, or the distance between languages affect automatic alignment quality (Tufis, 2007), manual post-editing may be required. At present, not many tools exist to correct automatic alignment. Overall, two types of tools for correcting alignment have been introduced: 1) those that use only visual representation and 2) those that combine visual representation with editing. There are several tools that allow users to visualize word alignment pairs without making any modifications. For example, Cairo (Smith and Jahr, 2000), designed as an evaluation tool for the Egypt translation system, displays each sentence pair (source language and its translation) with lines connecting aligned words. In contrast, VisualLIHLA, part of the lexical aligner LIHLA, shows the results of alignments by highlighting aligned words (Caseli et al., 2008). Other tools offer a visualization in combination with manual annotation, for example, ILink (Merkel et al., 2003a), Yawat (Germann, 2008), COWAL (Tufis, 2006), or the UMIACS word alignment interface (Hwa and Madnani, 2004), However, all of them are restricted to a specific format of alignment. Note that those tools, being part of the larger machine translation systems, do not function as individual editors. In this paper, we present SWIFT Aligner (short for: Speedy Word-alignment Interactive Functional Tool), a new software tool that not only allows for visual representation and editing of bitext language corpora, but also offers a number of 2913
2 Figure 1: An example of an aligned sentence. additional useful features: 1. Flexibility between Alignment Formats: Our tool is flexible in that it is not restricted to one specific format. Currently, SWIFT Aligner can import the most commonly used formats: TALP, used by the Berkeley aligner (Liang et al., 2006); GIZA, used by the Giza toolkit (Och and Ney, 2000); and NAACL, used by LIHLA (Caseli et al., 2005). In addition, SWIFT Aligner allows exporting corrected alignment into these formats as well. This flexibility to import and export various formats is a step towards bridging the gap between several machine translation tools, as it provides easier access to large data sets for linguistically oriented researchers, who are, otherwise, limited by specific format restrictions. Finally, an XML format, which is supported by a vast majority of software tools, is introduced for the purposes of internal representation for parallel alignment corpora. Consequently, this format is available for import/export purposes in SWIFT aligner. 2. Interactivity: Our tool is simple, intuitive to use, and interactive. While there are several visualization techniques for parallel alignment, namely word matrices, different coloring schemes for word pairs (see e.g. (Germann, 2008)), or enumerating links between each word pair, we chose the most common and simple technique: drawing lines between pairs of corresponding words. An example of the alignment visualization is shown in Figure 1. The visual interface allows a user to correct the alignment by dragging these lines to the correct source and target word pairs. 3. Multifunctionalilty: We have incorporated automatic part-of-speech (POS) and syntactic dependency annotation via cross-language transfer (see Figure 2 for an example on the POS level). The transfer is based on the word alignment. Thus, the user can import annotations for one (source) language, transfer these annotations to the target language, and work on correcting the transferred annotations manually. Post-transfer manual annotation, depending on the user preferences, can be performed inside SWIFT Aligner. Our tool provides support for manual creation of 2914
3 Figure 2: SWIFT aligner s main GUI. Alignment between words in 3 different sentences is presented. POS labels are displayed for each word. POS and syntactic dependency annotation, and the user can use the same GUI in order to perform the required annotation. However, if a user prefers to perform these corrections outside of SWIFT Aligner, it is possible to export the aligned and annotated text. 4. Platform Independence: SWIFT Aligner is implemented in Java (Java SE 6), thus providing a multi-platform functionality without a difficult installation procedure. The code is written in a modular way. This allows for easy understanding of the code as well as for extensibility. For example, introducing new data formats for exporting/importing purposes, is possible. For such extensions, pre-defined code templates exist and can easily be extended. An elementary familiarity with Java is required, however, in order to add such new functionality. Additionally, the tool supports UTF-8 representations, which allows for support of a wide range of languages, including right-to-left languages. 2. Related Work There is a considerable body of work on alignment for machine translation. Our work is more closely related to approaches that focus on visualizing and post-correcting alignment. Thus, we focus our dis- 2915
4 cussion of related work on work that is relevant in this regard. The combination of automatic alignment and manual post-editing has been introduced in recent alignment tools. One of the first such tools is Interactive Linker (Ahrenberg et al., 2002). This tool, and its later version I*Link, (Merkel et al., 2003b) are built for an interactive incremental alignment process, allowing the human annotator to adjust links between word pairs during automatic alignment. The Combined Word Aligner, COWAL (Tufis, 2006), is a combination of two aligners and a graphical user interface for intermediary and final alignment correction. There are also several tools that focus on the manual editing of alignments provided by the tool, for example, Yawat (Germann, 2008), COWAL, or the UMIACS word alignment interface (Hwa and Madnani, 2004). Note that none of the above listed tools allow imports from other machine translation alignment systems. Each of the mentioned tools for the visualization and editing maintains its own internal format. Since there are several commonly used formats, these tools are difficult to use in a more generalized setting. In the following, we will briefly describe these formats alongside with the XML format that we use for internal representation of parallel corpora in SWIFT aligner. All of the discussed formats represent the alignment shown in Figure 1. The first format that can be used with SWIFT Aligner is the TALP format. An example is shown in Figure 3. The alignment is represented in three separate files, one each for the source text, the target text, and the alignment information. The separate files are represented as individual lines in the example. The third line in Figure 3 encodes the alignment between words, the first number referring to the source sentence and the second number to the target sentence. Thus, the alignment 2-3 specifies that the second source word give is aligned with the third target word donne. The NAACL format, presented in Figure 4, is similar to TALP in that it stores the alignment in three separate files, which we represent by separate lines respectively. In this format, the alignment is represented by referring to the source word rather than its index. To assign a unique id to each sentence in these files, an XML-like tagging for sentence numbers is used. The GIZA format stores the alignment information in one file, with each sentence represented by two lines, the source (second line) and target sentence (first line), with the alignment encoding in parentheses. Thus, the word give in Figure 5 is aligned with the third word in the target sentence in the first line. Finally, the XML format stores all the alignment information in one file, as shown in Figure 6. First, the sentence id is denoted, which is followed by a source and target encodings. 3. Cross-Language Transfer The idea of cross-language transfer by means of word-alignment is not new. Over the past years, several studies have demonstrated the usability of parallel corpora for automatic (morpho-)syntactic transfer from a source language into a target language. One of the advantages of this method is the creation of NLP resources for less-common languages. The other positive outcome lies in the enhancement of machine translation system since many parallel corpora aligners achieve better accuracy when syntactically annotated texts are available. The feasibility of transfer methods has been tested in several realms of machine translation: for example, parallel bilingual parsing, part-of-speech transfer and part-of-speech tagger induction, noun phrase bracketer induction, syntactic transfer, and word sense disambiguation, among others (Yarowsky and Ngai, 2001; Lin, 1998; Wu, 1997; Tufis, 2006). In the area of part-of-speech transfer and sense annotation transfer, the transfer algorithm is traditionally based on a direct label projection from a source into a target language. For example, Yarowsky and Ngai (2001) describe an experiment for morphosyntactic transfer from English into French. The results show that morph-syntactic annotation can be effectively transferred using a small core tagset. In the area of syntactic transfer, it has been shown that syntactic dependencies are more suitable than syntactic constituents for cross-language syntactic transfer (Lin, 1998). Many models of syntactic dependency transfer follow a principle of direct correspondence assumption (DCA), which specifies that syntactic relations between nodes of the source language hold for the corresponding 2916
5 I give him the book. Je lui donne le livre Figure 3: TALP format example <s snum=1>i give him the book.</s> <s snum=1>je lui donne le livre.</s> <s snum=1>i:1 give:3 him:2 the:4 book:5.:6</s> Figure 4: NAACL format example Je lui donne le livre. NULL ({ }) I ({ 1 }) give ({ 3 }) him ({ 2 }) the ({ 4 }) book ({ 5 }). ({ 6 }) Figure 5: GIZA format example <sentence id="1"> <Source> <word align="1" form="i" id="1"/> <word align="3" form="gave" id="2"/> <word align="2" form="him" id="3"/> <word align="4" form="the" id="4"/> <word align="5" form="book" id="5"/> <word align="6" form="." id="6"/> </Source> <Target> <word form="je" id="1"/> <word form="lui" id="2"/> <word form="donne" id="3"/> <word form="le" id="4"/> <word form="livre" id="5"/> <word form="." id="6"/> </Target> </sentence> Figure 6: XML format example aligned nodes of the target language (Hwa et al., 2005). Several experiments have evaluated the DCA model for different languages. For example, Hwa et al. (2005) performed several experiments on languages with different word order, namely English, Spanish, and Chinese. While the direct projection from English yielded low unlabeled dependency F-scores (37% for Chinese and 38% for Spanish), the errors mostly occurred in cases where the target language required more projections than the source language (English). For instance, Chinese aspectual markers are not realized as a separate projection in English; therefore, they are left unlabeled during the transfer. The application of simple language-specific transformation, however, increased the accuracy to 68% for Chinese and 72% for Spanish transfer. In SWIFT aligner, both POS and syntactic depen- 2917
6 Figure 7: An example of aligned sentences with POS tags and source language dependencies displayed. dencies transfer procedures are available. The user can import POS annotations, dependency annotations, or both for the source language. Then, the system transfers the annotation to the target language. We follow a simple procedure in which the POS and dependencies are, essentially, assigned to the aligned structures in target language. Thus we are applying the direct correspondence assumption by Hwa et al. (2005). A situation where no alignment exists is presently resolved by assigning a tag NA to the target structure. This can then be manually post-corrected by the user (see next section). Note that currently, our focus is not on improving current cross-transfer strategies but rather to provide a user-friendly tool that allows for an easy inspection and correction of the transfer results. 4. Correcting Linguistic Annotation In order to further enhance SWIFT Aligner for linguistically oriented end users, we include the option of editing the POS tags and syntactic dependencies via the GUI, see Figure 7. The combination of annotation post-editing and alignment enhances the efficiency of manual correction as the user is able to compare target and source languages visually. To correct POS tags, the user can click on the the POS tags field and then correct the POS tags by typing in the correct version. Since in a cross-language transfer situation, it is possible that the user may want to introduce new POS tags for the target language, we do not check for consistency. Instead, the user has the possibility of accessing a list of all POS tags for each language. To correct the dependencies, the user can drag the dependency arc in the same way as for the alignment. To correct dependency labels, the strategy is the same as for POS tags. When the desired correction state is reached, the new results can be saved to the preferred output format. 5. Conclusions and Future Work We have introduced a new tool for parallel corpora visualization, editing, and (morpho-)syntactic 2918
7 cross-language transfer. This tool is intuitive and easy to use. It can be beneficial to researchers working with parallel corpora, as well as the broader linguistic community in cases where quick annotations of target languages are required. For the future, we are planning to integrate stateof-the-art cross-language transfer algorithms for POS tagging and dependency parsing. This will include the integration of a POS tagger and a dependency parser, which can be retrained on the target language annotations. 6. References Lars Ahrenberg, Mikael Andersson, and Magnus Merkel A system for incremental and interactive word linking. In Third International Conference on Language Resources and Evaluation (LREC), pages , Las Palmas, Gran Canaria. Helena M. Caseli, Maria G. V. Nunes, and Mikel L. Forcada LIHLA: Shared task system description. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages , Ann Arbor, MI. Helena Caseli, Felipe T. Gomes, Thiago A.S. Pardo, and Maria das Graças V. Nunes VisualLIHLA: The visual online tool for lexical alignment. In Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web, pages , Vila Velha, Brazil. Ulrich Germann Yawat: Yet another word alignment tool. In Proceedings of the ACL-08, pages 20 23, Columbus, OH. Rebecca Hwa and Nitin Madnani, The UMIACS word alignment interface. University of Maryland Institute for Advanced Computer Studies. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3): Kevin Knight and Daniel Marcu Machine translation in the year In Proceedings of ICASSP, Montréal, Canada. Percy Liang, Ben Taskar, and Dan Klein Alignment by agreement. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, HLT- NAACL 06, pages , New York, NY. Dekang Lin Dependency-based evaluation of minipar. In Proceedings of the LREC Workshop on the Evaluation of Parsing Systems, Granada, Spain. Magnus Merkel, Lars Ahrenberg, and Michael Pettersted. 2003a. Interactive word alignment for language engineering. In Proceedings of the Tenth Conference of the European Chapter of the ACL, pages 49 52, Budapest, Hungary. Magnus Merkel, Michael Pettersted, and Lars Ahrenberg. 2003b. Interactive word alignment for corpus linguistics. In In Proceedings of Corpus Linguistics 2003, Lancaster, UK. Franz Josef Och and Hermann Ney A comparison of alignment models for statistical machine translation. In Proceedings of the 18th Conference on Computational Linguistics, pages , Saarbrücken, Germany. Noah A. Smith and Michael E. Jahr Cairo: An alignment visualization tool. In Second International Conference on Linguistic Resources and Evaluation (LREC-2000). Dan Tufis From word alignment to word senses, via multilingual wordnets. In Computer Science Journal of Moldova, volume 14, pages Dan Tufis Exploiting aligned parallel corpora in multilingual studies and applications. In Proceedings of the 1st International Conference on Intercultural Collaboration, pages , Kyoto, Japan. Dekai Wu Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3): David Yarowsky and Grace Ngai Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the 2nd Meeting of NAACL, pages 1 8, Pittsburgh, PA. 2919
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia Sungchul Kim POSTECH Pohang, South Korea [email protected] Abstract In this paper we propose a method to automatically
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models
p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,
A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow
A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system
Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
Effective Self-Training for Parsing
Effective Self-Training for Parsing David McClosky [email protected] Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - [email protected]
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
Annotation Guidelines for Dutch-English Word Alignment
Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College
TRANSREAD: Designing a Bilingual Reading Experience with Machine Translation Technologies
TRANSREAD: Designing a Bilingual Reading Experience with Machine Translation Technologies François Yvon and Yong Xu and Marianna Apidianaki LIMSI, CNRS, Université Paris-Saclay 91 403 Orsay {yvon,yong,marianna}@limsi.fr
An Online Service for SUbtitling by MAchine Translation
SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2011 Editor(s): Contributor(s): Reviewer(s): Status-Version: Volha Petukhova, Arantza del Pozo Mirjam
Research Portfolio. Beáta B. Megyesi January 8, 2007
Research Portfolio Beáta B. Megyesi January 8, 2007 Research Activities Research activities focus on mainly four areas: Natural language processing During the last ten years, since I started my academic
Translating the Penn Treebank with an Interactive-Predictive MT System
IJCLA VOL. 2, NO. 1 2, JAN-DEC 2011, PP. 225 237 RECEIVED 31/10/10 ACCEPTED 26/11/10 FINAL 11/02/11 Translating the Penn Treebank with an Interactive-Predictive MT System MARTHA ALICIA ROCHA 1 AND JOAN
Chinese Open Relation Extraction for Knowledge Acquisition
Chinese Open Relation Extraction for Knowledge Acquisition Yuen-Hsien Tseng 1, Lung-Hao Lee 1,2, Shu-Yen Lin 1, Bo-Shun Liao 1, Mei-Jun Liu 1, Hsin-Hsi Chen 2, Oren Etzioni 3, Anthony Fader 4 1 Information
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
a Chinese-to-Spanish rule-based machine translation
Chinese-to-Spanish rule-based machine translation system Jordi Centelles 1 and Marta R. Costa-jussà 2 1 Centre de Tecnologies i Aplicacions del llenguatge i la Parla (TALP), Universitat Politècnica de
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013
Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data
An End-to-End Discriminative Approach to Machine Translation
An End-to-End Discriminative Approach to Machine Translation Percy Liang Alexandre Bouchard-Côté Dan Klein Ben Taskar Computer Science Division, EECS Department University of California at Berkeley Berkeley,
Language and Computation
Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University [email protected] http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ [email protected]
Applying Co-Training Methods to Statistical Parsing Anoop Sarkar http://www.cis.upenn.edu/ anoop/ [email protected] 1 Statistical Parsing: the company s clinical trials of both its animal and human-based
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning
PACLIC 24 Proceedings 357 GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning Mei-hua Chen a, Chung-chi Huang a, Shih-ting Huang b, and Jason S. Chang b a Institute of Information
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Text Generation for Abstractive Summarization
Text Generation for Abstractive Summarization Pierre-Etienne Genest, Guy Lapalme RALI-DIRO Université de Montréal P.O. Box 6128, Succ. Centre-Ville Montréal, Québec Canada, H3C 3J7 {genestpe,lapalme}@iro.umontreal.ca
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Parsing Technology and its role in Legacy Modernization. A Metaware White Paper
Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks
Language Model of Parsing and Decoding
Syntax-based Language Models for Statistical Machine Translation Eugene Charniak ½, Kevin Knight ¾ and Kenji Yamada ¾ ½ Department of Computer Science, Brown University ¾ Information Sciences Institute,
Chinese-Japanese Machine Translation Exploiting Chinese Characters
Chinese-Japanese Machine Translation Exploiting Chinese Characters CHENHUI CHU, TOSHIAKI NAKAZAWA, DAISUKE KAWAHARA, and SADAO KUROHASHI, Kyoto University The Chinese and Japanese languages share Chinese
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
DEPENDENCY PARSING JOAKIM NIVRE
DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes
Dutch Parallel Corpus
Dutch Parallel Corpus Lieve Macken [email protected] LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT
3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work
Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan [email protected]
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Factored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang [email protected], [email protected] School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
English to Arabic Transliteration for Information Retrieval: A Statistical Approach
English to Arabic Transliteration for Information Retrieval: A Statistical Approach Nasreen AbdulJaleel and Leah S. Larkey Center for Intelligent Information Retrieval Computer Science, University of Massachusetts
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System
Applying Statistical Post-Editing to English-to-Korean Rule-based Machine Translation System Ki-Young Lee and Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
Statistical NLP Spring 2008. Machine Translation: Examples
Statistical NLP Spring 2008 Lecture 11: Word Alignment Dan Klein UC Berkeley Machine Translation: Examples 1 Machine Translation Madame la présidente, votre présidence de cette institution a été marquante.
Building a Web-based parallel corpus and filtering out machinetranslated
Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract
Training and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])
PyCantonese: Cantonese linguistic research in the age of big data
PyCantonese: Cantonese linguistic research in the age of big data Jackson L. Lee University of Chicago http://jacksonllee.com Childhood Bilingualism Research Center, CUHK September 15, 2015 Grammar versus
Translation Solution for
Translation Solution for Case Study Contents PROMT Translation Solution for PayPal Case Study 1 Contents 1 Summary 1 Background for Using MT at PayPal 1 PayPal s Initial Requirements for MT Vendor 2 Business
Simple Type-Level Unsupervised POS Tagging
Simple Type-Level Unsupervised POS Tagging Yoong Keok Lee Aria Haghighi Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {yklee, aria42, regina}@csail.mit.edu
Example-based Translation without Parallel Corpora: First experiments on a prototype
-based Translation without Parallel Corpora: First experiments on a prototype Vincent Vandeghinste, Peter Dirix and Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven Maria
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Automatic Pronominal Anaphora Resolution in English Texts
Computational Linguistics and Chinese Language Processing Vol. 9, No.1, February 2004, pp. 21-40 21 The Association for Computational Linguistics and Chinese Language Processing Automatic Pronominal Anaphora
MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products
MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products Khalid Choukri Evaluation and Language resources Distribution Agency;
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Central and South-East European Resources in META-SHARE
Central and South-East European Resources in META-SHARE Tamás VÁRADI 1 Marko TADIĆ 2 (1) RESERCH INSTITUTE FOR LINGUISTICS, MTA, Budapest, Hungary (2) FACULTY OF HUMANITIES AND SOCIAL SCIENCES, ZAGREB
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer Reviews Minqing Hu and Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 {mhu1, liub}@cs.uic.edu
Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation
Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation Mu Li Microsoft Research Asia [email protected] Dongdong Zhang Microsoft Research Asia [email protected]
ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS
ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,
ANALEC: a New Tool for the Dynamic Annotation of Textual Data
ANALEC: a New Tool for the Dynamic Annotation of Textual Data Frédéric Landragin, Thierry Poibeau and Bernard Victorri LATTICE-CNRS École Normale Supérieure & Université Paris 3-Sorbonne Nouvelle 1 rue
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells Computer Science Dep., Universidad Autonoma de Madrid, 28049 Madrid, Spain
A chart generator for the Dutch Alpino grammar
June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:
