The current state of work on the Polish-Ukrainian Parallel Corpus (PolUKR).
|
|
|
- Brandon Bishop
- 10 years ago
- Views:
Transcription
1 Natalia Kotsyba Institute of Slavic Studies PAS (Warsaw) The current state of work on the Polish-Ukrainian Parallel Corpus (PolUKR). Objectives of creating the corpus PolUKR 1, a Polish-Ukrainian parallel corpus was launched as a pilot corpus project in the Institute of Slavic Studies of the Polish Academy of Sciences in The corpus is intended for use as a tool for both human and machine users, and language material for compiling bilingual Polish<>Ukrainian dictionaries and a contrastive grammar for Polish and Ukrainian. It can also be used as a translation database and language learning materials. Acquisition and preprocessing of parallel texts Currently the corpus contains ab. 2 mln tokens (ab. 500K tokens in 70 parallel texts are publicly available for search through the web interface at that represent mostly modern Ukrainian and Polish literature (the 2 nd part of the XXth century). Part of them was received from the translators 2, then the corresponding original was sought for and prepared accordingly. Another group was downloaded from existing digital libraries 3. The quality of the texts was often unsatisfactory, as in most cases electronic texts were acquired through scanning the paper editions that were later submitted to the automatic Optical Character Recognition (OCR) procedure and needed further corrections. A large group of the texts was originally in the hard copy format, they were scanned, cleaned from images, page numbers and other unnecessary information, then OCRed with the help of the FineReader 9.0 program, checked for mistakes that appeared as a consequence of a poor OCR, recorded as MSWord documents and converted into simple UTF-8 encoded xml files that contain the information about the division into paragraphs extracted from doc files with the help of the AutoReplace function. The text metadata are recorded into a MySQL database placed on the server. They include (if available): author, title, language, year of creation, publication place, year and publishing house, genre, translator, year of translation, source and original format of the text, etc. This information 1 The current, extended version of the corpus is partially supported by MNiSW (Polish Ministry of Science and Higher Education) grant N N in We would like to thank Katarzyna Kotyńska, Anna Łazar, Ola Hnatiuk and Helena Krasowska for sharing their texts. 3 Some of the libraries used can be found at:
2 may be used to restrict the scope of the search e.g. one can choose only the texts created after a specific date or by a specific author. Structural annotation The texts are segmented into chunks that can be of two types: paragraphs and sentences. Sentences are always parts of paragraphs. Such structure of the document is encoded in a corresponding Document Type Definition file. Morphological annotation For adding the morphosyntactic information for the Polish texts we use the freely available TаKIPI toolset developed by Marcin Woliński, Adam Przepiórkowski, Adam Radziszewski and Maciej Piasecki, that includes a text chunker, a lemmatizer, a morphological tagger and a disambiguator. Morphological tags are stored as value lists containing morphological class and grammatical categories adequate for a given class, e.g., the grammatical characteristics of jedziecie (you pl go) will be fin:pl:sec:imperf (finite verb form, plural, second person, imperfective aspect). If an ambiguity occurs for a given segment, several tags are listed. After the disambiguation procedure the most verisimilar candidate is given the disambiguation value 1. An example of a tagged chunk Dokąd jedziecie? <chunk type="p" xlink:href="#p5"> <chunk type="s"> <tok> <orth>dokąd</orth> <lex disamb="1"> <base>dokąd</base> <ctag>qub</ctag> </lex> </tok> <tok> <orth>jedziecie</orth> <lex disamb="1"> <base>jechać</base> <ctag>fin:pl:sec:imperf</ctag> </lex> </tok> <tok> <orth>?</orth> <lex disamb="1"> <base>?</base> <ctag>interp</ctag> </lex> </tok> </chunk> </chunk> For the Ukrainian language we use the UGS (Ukrainian Grammatical Dictionary) developed at the ULIF NASU by Igor Shevchenko and Oleksandr Rabulets, that enables lemmatization and morphological annotation of texts, although its does not support disambiguation at the moment.
3 A common morphosyntactic tagset for Polish and Ukrainian was developed by us for the corpus needs based on the mentioned resources, see [Kotsyba et al. 2008, Коциба 2009] for details. Language specific categories and values are preserved, as our intention was not to lose any information. All the details will not be seen at the GUI search-level, but will be accessible for advanced users through self-defined regex-based corpus queries. The basic changes we had to introduce include a higher POS granulation for Ukrainian and regrouping some word classes for Polish to fit a more traditional understanding of the parts of speech. These quasi-changes are realized with the help of the mechanism of aliases and effect only the GUI search level. Reorganizing of information about the degree for Ukrainian adjectives and adverbs from the lexical to grammatical level has also been done to keep to the standards both in traditional grammars and current commonly accepted NLP treatment of the degree as a grammatical category. The special treatment of predicatives that was followed by us as well is described in detail in [Derzhanski, Kotsyba 2008]. The above format was also used for the Ukrainian language while converting the original annotated files. Alignment Presently the parallel texts are aligned at the paragraph level dynamically, i.e. paragraphs are enumerated during the searching procedure and paragraphs with the same order number that the ones where the searched fragment is found are shown along with the KWICs. The difference in the paragraph division had to be removed manually, so that their order numbers and content where equal. This situation is provisional the paragraph level of the alignment is unsatisfactory as most paragraphs are too lengthy to easily spot the searched equivalent. The intended alignment level are sentences and, eventually, words. One of the freely available programs that aligns parallel texts at the sentence level is the language independent HunAlign. The result of the alignment is recorded either as an intertwined text or as sets of corresponding sentences, so called link groups, represented by sentence numbers or other identifiers. Additional numeric information about the accuracy of alignment can be included as well. The program foresees the use of a corresponding bilingual dictionary to ensure a higher accuracy of the alignment. Such a dictionary can also be generated by the program itself from the currently fed in bitexts, if not available otherwise. The results of aligning Polish and Ukrainian texts without a dictionary were far from satisfactory. For the purpose of a more accurate alignment we have developed a bilingual dictionary structured according to the HunAlign demands. It is recorded as a plain text where each entry takes a separate line: the original word or the equivalent word or expression. Since many words and expressions have several equivalents due to polysemy, the same entries on the left side can be repeated with different equivalents. The alignment dictionary was generated automatically from the database version of the Polish-Ukrainian dictionary that is currently developed as a joint project of ULIF NASU and ISS PAS, and contains entries. Fragment of the dictionary: окис свинцю білолиций білоплечий білорущина
4 все, що білоруське папіросний папір цигарковий папір манія збирати старі рукописи батожити батоження батіжок подібний до батіжка бити б'ю бігти Since both Polish and Ukrainian are highly inflected languages, basic dictionary forms are not enough. Either we need lemmatized texts, or a dictionary with all possible forms generated. The first option seems to be easier to realize, but for this we need to adjust the alignment algorithm and to work with already annotated texts. Another option for aligning is the TextAlign, a user friendly software with GUI and editing possibilities. The only possible input format there is RTF (rich text format), the output is a TMX file with an intertwined parallel text. The main problem with the unequal number of sentences in parallel texts that effected the quality of the results produced by the fully automatic and hardly controllable Hunalign is compensated by the possibility of an easy and quick alignment edition in the TextAlign. However, the sentence segmentation algorithm in the TextAlign is too simple for satisfactory results. Example of alignment results by TextAlign, pre-editing phase
5 It can be seen from the example above that sentence borders are defined basing on punctuation marks without considering common abbreviations ended with full stops, which can generate wrong sentence segmentation. Example of a manual splitting procedure with the help of TextAlign. At the moment we are developing a PLUczeK program that will combine the features of the HunAlign and the TextAlign. It will include an editable plugging-in module of text-segmentation at the paragraph and sentence levels, which has to ensure language independence of the program. The sentence segmentation module is rule based, it presupposes the use of such heuristics as common abbreviation to function as a stop list, combinations and sequences of abbreviations and punctuation marks, forms of the reported speech presentation (that can also be different across languages), cf. also [Rudolf, 2004]. The program will work with both plain texts and morphologically annotated xml files, addressing either the information about the actual form of the token, or its lemma, as well as using grammatical information for sentence segmentation (a verb or a preposition cannot be a proper name, hence, written with a capital letter they signal about the beginning of a sentence, etc.). The program will also have a GUI interface and enable editing of the segmentation. We have chosen the XCES format for alignment records. The information about corresponding sentences is stored in a separate file. An example fragment of an alignment file is below (sentences 1 i 2 of the second link group are translated as one sentence). <cesalign> <cesheader> <translations xml:base=" <translation trans.loc="exampleana.ua.xml" lang="ua" xml:lang="ua" n="1" />
6 <translation trans.loc="exampleana.pl.xml" lang="pl" xml:lang="pl" n="2" /> </translations> </profiledesc> </cesheader> <linklist> <linkgrp id="p1" targtype="s"> <link> <align xlink:href="#p1s1" /> <align xlink:href="#p1s1" /> </link> <link> <align xlink:href="#p1s2" /> <align xlink:href="#p1s2" /> </link> </linkgrp> <linkgrp id="p2" targtype="s"> <link> <align xlink:href="#xpointer(id('p2s1')/range-to(id('p2s2')))" /> <align xlink:href="#p2s1" /> </link> <link> <align xlink:href="#p2s3" /> <align xlink:href="#p2s2" /> </link> </linkgrp> </linklist> </cesalign> Even sentence alignment cannot reach a 100% accuracy due to objective reasons. In the table below, fragments that are parts of one sentence are highlighted with the same shade. Dokonany przez Namiera wybór 1848 r. jako punktu wyjścia był jak najbardziej uzasadniony. Zgodnie z obiegową opinią, rok ów stanowił punkt zwrotny w historii, w którym historii nie udało się dokonać zwrotu. Pogląd ten jest jednak błędny to właśnie wtedy wybuchły pierwsze europejskie rewolucje. Ich epicentrum stanowiła Francja, ale ruchy rewolucyjne objęły również Palermo, Neapol, Wiedeń, Berlin, Budę i Poznań, żeby wymienić tylko kilka miast. Нейм єр обирає як вихідний пункт 1848 рік, і цей вибір добре обґрунтований. Існує відоме затерте кліше, що 1848 рік був поворотним пунктом історії, але тоді історія не змогла повернути в інший бік, проте це не так це рік перших европейських революцій: їхнім епіцентром була Франція, також революції відбулися у Палермо, Неаполі, Відні, Берліні, Буді й Познані і це ще далеко не всі міста. This means that mistakes are practically unavoidable, especially with large amounts of texts, but its is still possible to keep the general quality of the corpus sufficient for working with it and receiving objective results. Conclusions and further work
7 The current state of PolUKR enables already searching for translation equivalent and can be used as a translation memory database both by human translators and researchers and machines. But the corpus can be enhanced in a number of ways, like finer alignment level, enriching with further annotation of different types, including also semantic and referential information. Automatic wordlevel alignment can be of significant help while compiling bilingual dictionaries. The search engine has to be adjusted to enable searching for the new information as well. Literature Broda Bartosz, Piasecki Maciej & Radziszewski Adam. Towards a Set of General Purpose Morphosyntactic Tools for Polish. Proceedings of Intelligent Information Systems, Zakopane Poland, Institute of Computer Science PAS, Ivan Derzhanski and Natalia Kotsyba. The Category of Predicatives in the Light of Consistent Morphosyntactic Tagging of Slavic Languages. Proceedings of the International Workshop within MONDILEX project, Moscow, 2-4 October Hunalign - sentence level aligner: Natalia Kotsyba, Olha Shypnivska and Magdalena Turska. Linguistic principles of organizing a common morphological tagset for PolUKR (Polish-Ukrainian Parallel Corpus). Proceedings of Intelligent Information Systems, Zakopane, Poland, Institute of Computer Science PAS, Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for Polish. In: The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL Michał Rudolf. Metody automatycznej analizy korpusu tekstów polskich. Pozyskiwanie, wzbogacanie i przetwarzanie informacji lingwistycznych. Warszawa, TextAlign in MT2007 (Memory Translation Computer Aided Tool): Magdalena Turska and Natalia Kotsyba. Polsko-Ukraiński korpus równoległy (PolUKR). Materiały LXIII Zjazdu Polskiego Towarzystwa Językoznawczego, Warszawa. Magdalena Turska and Natalia Kotsyba. Polish-Ukrainian Parallel Corpus and its Possible Applications, Proceedings of the International Conference "Practical Applications in Language and Computers, 7-9 April, Łódź", Peter Lang GmbH, v. Waldenfels, R. Compiling a parallel corpus of slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (Hrsg.); Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9. München, , Коциба Наталія. Принципи морфосинтактичного таґування польсько-українського паралельного корпусу (PolUKR). Proceedings of the International Conference MegaLing'2008. Horizons of Applied Linguistics and Linguistic Technologies, Parthenit Crimea, Ukraine, September 2008, 2009 (in preparation).
8 Широков В.А, О.В.Бугаков, Т.О.Грязнухіна, О.М.Костишин, М.Ю.Кригін, Т.П.Любченко, О.Г.Рабулець, О.О.Сидоренко, Н.М.Сидорчук, І.В.Шевченко, О.О.Шипнівська, К.М.Якименко. Корпусна лінгвістика. Київ: Довіра, Abstract The article describes the present state of work on PolUKR, the Polish-Ukrainian parallel corpus, developed in the Institute of Slavic Studies of the Polish Academy of Sciences since Presented are the ways of bitexts acquisition, their structure and pre-processing stages; the solutions concerning the common morphosyntactic annotation pattern for Polish and Ukrainian, as well as annotation methods; the alignment format and the software used or developed for the corpus needs. Recommendations One of the objectives of the current project is to develop a scheme for creating a parallel corpus for any pair of Slavic languages. At the moment a researcher who deals with Slavic parallel corpora envisages several major problems that need to be attended to. One of the still unresolved issues is a common morphological annotation tagset for Slavic languages that should ensure uniform search through both parts of a corpus at the same time. Technical bilingual dictionaries for sentence alignment as well as a user friendly alignment editor are necessary to enable controllable high-quality alignment. A free, platform independent search engine for parallel corpora is also needed.
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Dutch Parallel Corpus
Dutch Parallel Corpus Lieve Macken [email protected] LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
Schema documentation for types1.2.xsd
Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH
Journal of Computer Science 9 (7): 922-927, 2013 ISSN: 1549-3636 2013 doi:10.3844/jcssp.2013.922.927 Published Online 9 (7) 2013 (http://www.thescipub.com/jcs.toc) ARABIC PERSON NAMES RECOGNITION BY USING
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013
Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data
Glossary of translation tool types
Glossary of translation tool types Tool type Description French equivalent Active terminology recognition tools Bilingual concordancers Active terminology recognition (ATR) tools automatically analyze
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
An Online Service for SUbtitling by MAchine Translation
SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2011 Editor(s): Contributor(s): Reviewer(s): Status-Version: Volha Petukhova, Arantza del Pozo Mirjam
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software
Semantic Research using Natural Language Processing at Scale; A continued look behind the scenes of Semantic Insights Research Assistant and Research Librarian Presented to The Federal Big Data Working
Central and South-East European Resources in META-SHARE
Central and South-East European Resources in META-SHARE Tamás VÁRADI 1 Marko TADIĆ 2 (1) RESERCH INSTITUTE FOR LINGUISTICS, MTA, Budapest, Hungary (2) FACULTY OF HUMANITIES AND SOCIAL SCIENCES, ZAGREB
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark [email protected] Outline Flow chart Linguateca Palavras History
A Machine Translation System Between a Pair of Closely Related Languages
A Machine Translation System Between a Pair of Closely Related Languages Kemal Altintas 1,3 1 Dept. of Computer Engineering Bilkent University Ankara, Turkey email:[email protected] Abstract Machine translation
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5
Projet ANR 201 2 CORD 01 5 TRANSREAD Lecture et interaction bilingues enrichies par les données d'alignement LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS Avril 201 4
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Extracting translation relations for humanreadable dictionaries from bilingual text
Extracting translation relations for humanreadable dictionaries from bilingual text Overview 1. Company 2. Translate pro 12.1 and AutoLearn 3. Translation workflow 4. Extraction method 5. Extended
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
Comprendium Translator System Overview
Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4
Example-based Translation without Parallel Corpora: First experiments on a prototype
-based Translation without Parallel Corpora: First experiments on a prototype Vincent Vandeghinste, Peter Dirix and Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven Maria
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer Timur Gilmanov, Olga Scrivner, Sandra Kübler Indiana University
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
Language and Computation
Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University [email protected] http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Introduction to WIPOScan Software
Introduction to WIPOScan Software An overview of available WIPO technical assistance on digitization, such as WIPOScan and detailed modules for digitizing all kinds of industrial property data Gregory
Stock Market Prediction Using Data Mining
Stock Market Prediction Using Data Mining 1 Ruchi Desai, 2 Prof.Snehal Gandhi 1 M.E., 2 M.Tech. 1 Computer Department 1 Sarvajanik College of Engineering and Technology, Surat, Gujarat, India Abstract
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Natural Language Processing
Natural Language Processing 2 Open NLP (http://opennlp.apache.org/) Java library for processing natural language text Based on Machine Learning tools maximum entropy, perceptron Includes pre-built models
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
TS3: an Improved Version of the Bilingual Concordancer TransSearch
TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by
Machine Translation as a translator's tool. Oleg Vigodsky Argonaut Ltd. (Translation Agency)
Machine Translation as a translator's tool Oleg Vigodsky Argonaut Ltd. (Translation Agency) About Argonaut Ltd. Documentation translation (Telecom + IT) from English into Russian, since 1992 Main customers:
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
Connections to External File Sources
Connections to External File Sources By using connections to external sources you can significantly speed up the process of getting up and running with M-Files and importing existing data. For instance,
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,
FreeDict: an Open Source repository of TEI-encoded bilingual dictionaries
FreeDict: an Open Source repository of TEI-encoded bilingual dictionaries Piotr Bański Institute of English Studies, University of Warsaw Beata Wójtowicz Dept of African Languages and Cultures, University
Using the BNC to create and develop educational materials and a website for learners of English
Using the BNC to create and develop educational materials and a website for learners of English Danny Minn a, Hiroshi Sano b, Marie Ino b and Takahiro Nakamura c a Kitakyushu University b Tokyo University
Recent Developments in ParaSol: Breadth for Depth and XSLT based web concordancing with CWB
Recent Developments in ParaSol: Breadth for Depth and XSLT based web concordancing with CWB Ruprecht von Waldenfels Institut für slavische Sprachen und Literaturen, Universität Bern Abstract. The article
BILINGUAL TRANSLATION SYSTEM
BILINGUAL TRANSLATION SYSTEM (FOR ENGLISH AND TAMIL) Dr. S. Saraswathi Associate Professor M. Anusiya P. Kanivadhana S. Sathiya Abstract--- The project aims in developing Bilingual Translation System for
a Chinese-to-Spanish rule-based machine translation
Chinese-to-Spanish rule-based machine translation system Jordi Centelles 1 and Marta R. Costa-jussà 2 1 Centre de Tecnologies i Aplicacions del llenguatge i la Parla (TALP), Universitat Politècnica de
Factored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang [email protected], [email protected] School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics
Introduction to Text Mining. Module 2: Information Extraction in GATE
Introduction to Text Mining Module 2: Information Extraction in GATE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])
Exalead CloudView. Semantics Whitepaper
Exalead CloudView TM Semantics Whitepaper Executive Summary As a partner in our business, in our life, we want the computer to understand what we want, to understand what we mean, when we interact with
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia Outline I What is CALL? (scott) II Popular language learning sites (stella) Livemocha.com (stacia) III IV Specific sites
Computer Aided Document Indexing System
Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,
Database Design For Corpus Storage: The ET10-63 Data Model
January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million
TRANSLATING POLISH TEXTS INTO SIGN LANGUAGE IN THE TGT SYSTEM
20th IASTED International Multi-Conference Applied Informatics AI 2002, Innsbruck, Austria 2002, pp. 282-287 TRANSLATING POLISH TEXTS INTO SIGN LANGUAGE IN THE TGT SYSTEM NINA SUSZCZAŃSKA, PRZEMYSŁAW SZMAL,
Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?
Text Mining 2004-2005 Master TKI Antal van den Bosch en Walter Daelemans http://ilk.uvt.nl/~antalb/textmining/ Dinsdag, 10.45-12.30, SZ33 Timeline (1) [1 februari 2005] Introductie (WD) [15 februari 2005]
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
