MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE

Size: px
Start display at page:

Download "MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE"

Transcription

1 MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE A Thesis Submitted for the Degree of Doctor of Philosophy in the School of Engineering by ANAND KUMAR M CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM COIMBATORE , TAMILNADU, INDIA April, 2013

2 AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE BONAFIDE CERTIFICATE This is to certify that the thesis entitled MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE submitted by Mr. ANAND KUMAR M, Reg. No. CB.EN.D*CEN08002 for the award of the Degree of Doctor of Philosophy in the School of Engineering is a bonafide record of the work carried out by him under my guidance and supervision at Amrita School of Engineering, Coimbatore. Thesis Advisor Dr. K.P.SOMAN Professor and Head, Center for Excellence in Computational Engineering and Networking.

3 AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING DECLARATION I, ANAND KUMAR M (Reg. No. CB.EN.D*CEN08002) hereby declare that this thesis entitled MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE is the record of the original work done by me under the guidance of Dr. K.P. SOMAN, Professor and Head, Center for Excellence in Computational Engineering and Networking, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree/diploma/associateship/fellowship or a similar award, to any candidate in any University. Place: Coimbatore Date: Signature of the Student COUNTERSIGNED Thesis Advisor Dr. K.P.SOMAN Professor and Head Center for Excellence in Computational Engineering and Networking

4 TABLE OF CONTENTS ACKNOWLEDGEMENT... xii LIST OF FIGURES... xv LIST OF TABLES... xviii ABBREVIATIONS... xxi ABSTRACT... xxiv 1 INTRODUCTION GENERAL OVERVIEW OF MACHINE TRANSLATION ROLE OF MACHINE TRANSLATION IN NLP FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM MOTIVATION OF THE THESIS OBJECTIVE OF THE THESIS RESEARCH METHODOLOGY Overall System Architecture Details of Preprocessing English Language Sentence Reordering English Language Sentence Factorization of English Language Sentence Compounding of English Language Sentence Details of Preprocessing Tamil Language Sentence Tamil Part-of-Speech Tagger Tamil Morphological Analyzer Factored SMT System for English to Tamil Language Postprocessing for English to Tamil SMT Tamil Morphological Generator RESEARCH CONTRIBUTIONS ORGANISATION OF THE THESIS LITERATURE SURVEY PART OF SPEECH TAGGER iv

5 2.1.1 Part Of Speech Tagger for Indian Languages Part Of Speech Tagger for Tamil Language MORPHOLOGICAL ANALYZER AND GENERATOR Morphological Analyzer and Generator for Indian Languages Morphological Analyzer and Generator for Tamil Language MACHINE TRANSLATION SYSTEMS Machine Translation Systems for Indian Languages Machine Translation Systems for Tamil Language ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM RELATED NLP WORKS IN TAMIL SUMMARY THEORITICAL BACKGROUND GENERAL Tamil Language Tamil Grammar Tamil Characters Morphological Richness of Tamil Language Challenges in Tamil NLP Ambiguity in Morpheme Ambiguity in Word Class Ambiguity in Word Sense Ambiguity in Sentence MORPHOLOGY Types of Morphology Lexemes Lemma and Stems Inflections and Word forms Morphemes and Types Allomorphs Morpho-Phonemics Morphotactics MACHINE LEARNING FOR NLP v

6 3.3.1 Machine Learning Support Vector Machines Geometrical Interpretation of SVM SVM Formulation VARIOUS APPROACHES FOR POS TAGGING Supervised POS Tagging Unsupervised POS Tagging Rule based POS Tagging Stochastic POS Tagging Other Techniques VARIOUS APPROACHES FOR MORPHOLOGICAL ANALYZER Two level Morphological Analysis Unsupervised Morphological Analyser Memory based Morphological Analysis Stemmer based Approach Suffix Stripping based Approach VARIOUS APPROACHES IN MACHINE TRANSLATION Linguistic or Rule based Approaches Direct Approach Interlingua Approach Transfer Approach Non Linguistic Approaches Dictionary based Approach Empirical or Corpus based Approach Example based Approach Statistical Approach Hybrid Machine Translation System EVALUATING STATISTICAL MACHINE TRANSLATION Human Evaluation Techniques Automatic Evaluation Techniques BLEU Score NIST Metric Precision and Recall vi

7 Edit Distance Measures SUMMARY PREPROCESSING FOR ENGLISH LANGUAGE MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE POS and Lemma and Information Syntactic Information Dependency Information DETAILS OF PREPROCESSING ENGLISH SENTENCES Reordering English Sentences Syntactic Comparision between English and Tamil Reordering Methodology Factoring English Sentence Compounding English Language Sentence Morphological Comparision between English and Tamil Compounding Methodology for English Sentence Integrating Reordering and Compounding SUMMARY PART OF SPEECH TAGGER FOR TAMIL LANGUAGE GENERAL Part of Speech Tagging Tamil POS Tagging COMPLEXITY IN TAMIL POS TAGGING Root Ambiguity Noun Complexity Verb Complexity Adverb Complexity Postposition Complexity PART OF SPEECH TAGSET DEVELOPMENT Available POS Tagsets for Tamil AMRITA POS Tagset DEVELOPMENT OF TAMIL POS CORPORA FOR PREPROCESSING129 vii

8 5.4.1 Untagged and Tagged Corpus Available Corpus for Tamil POS Tagged Corpus Development Applications of Tagged Corpus Details of POS Tagged corpus developed DEVELOPMENT OF POS TAGGER USING SVMTOOL SVMTool Features of SVMTool Components of SVMTool SVMTlearn SVMTagger SVMTeval RESULTS AND COMPARISON WITH OTHER TOOLS ERROR ANALYSIS SUMMARY MORPHOLOGICAL ANALYZER FOR TAMIL GENERAL Morphology in Language Computational Morphology Morphological Analyzer Role of Morphological Analyzer in NLP TAMIL MORPHOLOGY Tamil Morphology and Language Syntax of Tamil Morphology Word Formation Rules(WFR) in Tamil Tamil Verb Morphology Tamil Noun Morphology Tamil Morphological Analyzer Challenges in Tamil Morphological Analzer TAMIL MORPHOLOGICAL ANALYZER SYSTEM TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND VERBS Morphological Analyzer using Machine Learning viii

9 6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer Paradigm Classification Word forms Morphemes Data Creation for Noun/Verb Morphological Analyzer Issues in Data Creation Morphological Tagging Framework using SVMTool Support Vector Machine (SVM) SVMTool Implementation of Morphological Analyzer System MORPH ANALYZER FOR PRONOUN USING PATTERNS MORPH ANALYZER FOR PROPER NOUN USING SUFFIXES RESULTS AND EVALUATION PREPROCESSED ENGLISH AND TAMIL SENTENCE SUMMARY FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL STATISTICAL MACHINE TRANSLATION COMPONENTS OF SMT Translation Model Expectation Maximization Word based Translation Model Phrase based Translation Model Language Model N-gram Language Models Statistical Machine Translation Decoder INTEGRATING LINGUISTIC INFORMATION IN SMT Factored Translation Models Decomposition of Factored Translation Syntax based Translation Models TOOLS USED IN SMT SYSTEM MOSES GIZA++ & MKCLS ix

10 7.4.3 SRILM DEVELOPMENT OF FACTORED CORPORA Parallel Corpora Collection Monolingual Corpora Collection Automatic Creation of Factored Corpora FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE Building Language Model Building Phrase based Translation Model SUMMARY POSTPROCESSING FOR ENGLISH TO TAMIL SMT GENERAL MORPHOLOGICAL GENERATOR Challenges in Tamil Morphological Generator Simplified Part-of-Speech Catagories MORPHOLOGICAL GENERATOR FOR TAMIL NOUN AND VERB Algorithm for Noun and Verb Morphological Generator Word-forms Handled in Morphological Generator Data Required for the Algorithm Morpho Lexical Information File Paradigm Classification Rules Suffix Table Stemming Rules MORPHOLOGICAL GENERATOR FOR TAMIL PRONOUNS SUMMARY EXPERIMENTS AND RESULTS GENERAL EXPERIMENTAL SETUP AND RESULTS SUMMARY CONCLUSION AND FUTUREWORK SUMMARY x

11 10.2 SUMMARY OF WORK DONE CONCLUSIONS FUTURE DIRECTIONS APPENDIX-A A.1 TAMIL TRANSLITERATION A.2 DETAILS OF AMRITA POS TAGS APPENDIX-B B.1 PENN TREE BANK POS TAGS B.2 DEPENDENCY TAGS B.3 TAMIL VERB MLI B.4 TAMIL NOUN WORD FORM B.5 TAMIL VERB WORD FORM B.6 MOSES INSTALLATION AND TRAINING B.7 COMPARISION WITH GOOGLE OUTPUT B.8 GRAPHICAL USER INTERFACES REFERENCES AUTHOR S PUBLICATIONS xi

12 ACKNOWLEDGEMENT I would never have been able to finish my dissertation without the guidance, support and encouragement of numerous people including my mentors, my friends, colleagues and support from my family and wife. At the end of my thesis I would like to thank all those people who made this thesis possible and an unforgettable experience for me. First and foremost, I feel deeply indebted to Her Holiness Most Revered Mata Amritanandamayi Devi (Amma) for her inspiration and guidance throughout of my doctoral studies, both in unseen and unconcealed ways. Wholeheartedly, I thank our respected Pro Chancellor, Swami Abhayamrita Chaitanya, by providing the necessary environment, infrastructure and encouragement for my research in Amrita Vishwa Vidyapeetham University. I thank Dr. P. Venkat Rangan, our respected Vice Chancellor, for his full hearted encouragements and supports throughout my doctoral studies. I would like to express my sincere gratitude to my supervisor, Dr. K.P Soman, Professor and Head, Centre for Excellence in Computational Engineering and Networking (CEN), for his excellent guidance, patience, and providing an excellent atmosphere for doing research. His wide knowledge and logical way of thinking have been of great source of inspiration for me. I am really so happy and proud to say that I am a student of Dr.K.P.Soman. He has always extended his helping hands in solving research problems. The in-depth discussions, scholarly supervision and constructive suggestions received from him have broadened my knowledge. I strongly believe that without his guidance, the present work could have not reached this stage. I wish to thank my doctoral committee members Dr.C.S Shunmuga Velayutham and Dr.V.P.Mohandass, for their encouraging words and support throughout this research. I express my heartfelt gratitude to Dr.N.S.Pandian, Dean, PG Programmes, Amrita Vishwa Vidyapeetham, and Coimbatore, for the continuous support of my Ph.D study and research. xii

13 I wish to thank Dr.S.Rajendran for his supervision, advice, and guidance from the very early stage of this research as well as giving me extraordinary experiences through-out the work. I express my deepest gratitude to Mrs.V.Dhanalakshmi, Head of the Department-Tamil, SRM University, Chennai. Whatever knowledge I have gained in linguistic is definitely because of her. I also wish to thank my school teacher Mr. B. Vaithiyanathan M.Sc M.Ed for supporting me from School days. I would like to thank Mr. Arun Sankar K, who as a good friend from my graduate is always willing to help and give his best suggestions. I express my sincere gratitude to my beloved Director, Dr.K.A.Chinnaraju, and Principal, Dr N.Nagarajan, CIET for giving me all the moral support to complete the thesis successfully. I would like to express my gratitude to my Head of the Department Dr.S.Gunasekaran, who is always inspiring me to complete this thesis work. I would also like to thank Mr.G.Ravi Kumar and Prof. Mrs.Janaki Kumar for their timely support and suggestions. I would like to thank my colleagues at the department of Computer science and engineering, especially Mr. N.Ramkumar, Mr.N.Boopal, Mr.A.Suresh, Mr.M.Yogesh, Mr.C.Prabu, and Mr.B.Saravanan for sharing their enthusiasm and for supporting me from the beginning of my career at CIET. I wish to express my warm and sincere thanks to Dr. Mrs. M.S Vijaya, HOD (MCA), GRD Krishnamal College for Women and Dr.M. Sabarimalai Manikandan, SAMSUNG Electronics, for their kind support and direction which have been of great value in this study. My sincere thanks also goes to Mr.Sivaprathap, Mr.Rakesh Peter, Mr.Loganathan and Mr.Antony P J, Mr.Ajit, Mr Saravanan, Mr.Kathir, Mr. Senthil, Mr.V Anand Kumar, Mrs. Latha Menon, and Sampath Kumar CEN department for supporting me in all the ways. I also express my sense of gratitude to my friends Ms.Resmi N.G and Ms.Preeja for their encouragement and guidance. My research would not have been possible without the help of my friends C.Murugesan, S.Ramakrishnan, S.Mohanraj and A.Baladhandapani, I like to thank them for being with me in all circumstances. xiii

14 I wish to give a special thank to my friends Mrs. Rekha Kishore, Mr.C. Arun Kumar, Mrs. Padmavathy and Mr.Tirumeni for supporting me in this research. I would like to thank to my Grandpa Mr.M.Narayanasamy and Mr. A.Peter who left us too soon. I hope that this work will make them proud. I would like to thank my uncle Mr.P.M.Palraj and aunt Mrs.P.Rajeswari for their encouragement and motivation during my difficult moments during the long years of my education. I would also like to express deepest gratitude to my Grandma Mrs.N.Valliyammal and my uncles Mr.N.Natesapandiyan and Mr.N.Pandiyan for supporting me from my school days. I want to thank my parents Mr. N. Madasamy and Mrs. M.Manohari for their kind support, the confidence and the love they have shown to me. You have been my greatest strength and I am blessed to be your son. I would also like to give a special thanks to my beloved brother Mr.M.Vasanthkumar for his support to me in all ways. I wish to thank my sister Mrs.S.Arthi and her husband Mr.K.Suresh, for supporting me in all the ways. I would like to thank my father-in-law Mr.P.Velusamy, and mother-in-law Mrs.V. Ponnuthai, without their encouragement and moral support it would have been impossible for me to finish this work. Finally, I would like to give a special thank to my wife Mrs.Sharmiladevi V. She is always there for cheering me up at difficult times with great patience. Without her love and support it would have been impossible for me to finish this work. -ANAND KUMAR M xiv

15 LIST OF FIGURES Figure 1.1 Morphology based Factored SMT for English to Tamil Language Figure 1.2 Reordering of English Language Figure 1.3 Mapping English Word Factors to Tamil Word Factors Figure 1.4 Thesis Organizations Figure 3.1 Maximum Margin and Support Vectors Figure 3.2 Training Errors in Support Vector Machine Figure 3.3 Non-linear Classifier Figure 3.4 Classification of POS Tagging Models Figure 3.5 Two Level Morphology Figure 3.6 Block Diagram of Direct Approach to Machine Translation Figure.3.7 The Vauquios Triangle Figure 3.8 Block Diagram of Transfer Approach Figure 3.9 Block Diagram of EBMT System Figure 3.10 Block Diagram of SMT System Figure 3.11 Rule based Translation System with Post-processing Figure 3.12 Statistical Machine Translation System with Pre-processing Figure 4.1 Example of English Syntactic Tree Figure 4.2 Preprocessing Stages of English Sentence Figure 4.3 Process of Reordering Figure 4.4 English Syntactic Tree Figure 4.5 English to Tamil Alignment Figure 4.6 Block Diagram for Compounding Figure 4.7 Integration Process Figure 5.1 Example of Untagged Corpus xv

16 Figure 5.2 Example of Tagged Corpus Figure 5.3 Untagged Corpus before Pre-editing Figure 5.4 Untagged Corpus after Pre-editing Figure 5.5 Training Data Format Figure 5.6 Implementation of SVMTlearn Figure 5.7 Example Input Figure 5.8 Example Output Figure 5.9 Implementation of SVMTagger Figure 5.10 Implementation of SVMTeval Figure 6.1 Role of Morphological Analyzer in NLP Figure 6.3 General Framework for Morphological Analyzer System Figure 6.4 Preprocessing Steps Figure 6.5 Implementation of Noun/Verb Morph Analyzer Figure 6.6 Structure of Pronoun Word form Figure 6.7 Implementation of Pronoun Morph Analyzer Figure 6.8 Implementation of Proper Noun Morph Analyzer Figure 6.9 Training Data Vs Accuracy Figure 7.1 The Noisy Channel Model to Machine Translation Figure 7.2 Block Diagram for Factored Translation Figure 7.3 Mapping English Factors to Tamil Factors Figure 8.1 Tamil Sentence Generation Figure 8.2 Algorithm for Morphological Generator Figure 8.3 Architecture of Tamil Morphological Generator Figure 8.4 Pseudo Code for Paradigm Classification Figure 8.5 Structure of Pronoun Word form Figure 8.6 Pronoun Morphological Generator xvi

17 Figure 9.1 BLEU-1 Score for Various Models Figure 9.2 BLEU-4 Score for Various Models Figure 9.3 NIST Score for Various Models Figure 9.4 Google Translation System xvii

18 LIST OF TABLES Table 1.1 Factored English Sentences Table 1.2 Compounded English Sentences Table 3.1 Tamil Grammar Table 3.2 Tamil Vowels Table 3.3 Tamil Compound Letters Table 3.4 Ambiguity in Morpheme s Position Table 3.5 An Example to Illustrate the Direct Approach Table 3.6 An Example for Interlingua Representation Table 3.7 An Example for Transfer Approach Table 3.8 Example of English and Tamil Sentences Table 3.9 Scales of Evaluation Table 4.1 POS and Lemma of Words Table 4.2 Reordering Rules Table 4.3 Original and Reordered Sentences Table 4.4 Description of Factors in English Word Table 4.5 Example of English Word Factors Table 4.6 Factored Representation of English Language Sentence Table 4.7 Word forms of English Table 4.8 Content Words of English Table 4.9 Function Words of English Table 4.10 English Word Forms based on Tenses Table 4.11 Tamil Word Forms based on Tenses Table 4.12 Compounding Rules for English Sentence Table 4.13 Average Words per Sentence Table 4.14 Factored English Sentence xviii

19 Table 4.15 Compounded English Sentence Table 4.16 Preprocessed English Sentences Table 5.1 AMRITA POS Tagset Table 5.2 Tag Count Table 5.3 Corpus Statistics Table 5.4 Example of Suitable POS Features for Model Table 5.5 Example of Suitable POS Features for Model Table 5.6 Example of Suitable POS Features for Model Table 5.7 Comparison of Accuracies Table 5.8 Trials and Error Table 5.9 Confusion Matrix Table 6.1 Compound Word-forms Formation Table 6.2 Simple Verb Finite Forms Table 6.3 Noun Case Markers Table 6.4 Minimized POS Tagset Table 6.5 Number of Paradigms and Inflections Table 6.6 Noun Paradigms Table 6.7 Verb Paradigms Table 6.8 Noun Word Forms Table 6.9 Verb Word Forms Table 6.10 Noun Morphemes Table 6.11 Verb Morphemes Table 6.12 Verb/Noun Ambiguous Morphemes Table 6.13 Sample Data Format Table 6.14 Example of Proper Noun Inflections Table 6.15 Tagged Vs Untagged Accuracies Table 6.16 Number of Words and Characters and Level of Efficiencies xix

20 Table 6.17 Sentence Level Accuracies Table 6.18 Preprocessed English and Tamil Sentence Table 7.1 Factored Parallel Sentences Table 8.1 Morpho-phonemic Changes Table 8.2 Simplified POS Tagset Table 8.3 Verb and Noun Word Forms Table 8.4 MLI for Tamil Verb Table 8.5 Look up Table for Paradigm Classification Table 8.6 Paradigms and inflections Table 8.7 Suffix Table Table 8.8 Stemming End Characters Table 9.1 Details of Baseline Parallel Corpora Table 9.2 Details of Factored Parallel Corpora Table 9.3 BLEU and NIST Scores Table 10.1 Mapping of Major Research Outcome to Publications xx

21 LIST OF ABBREVIATIONS ABBREVIATIONS 1PL 1S 2PE 2S 2SE 3PE 3PN 3SE 3SF 3SM 3SN ACC AI AU-KBC BL BLEU CALTS CIIL CLIR CRF CWF EBMT EM EOS FSA FSM FSMT FST FULL FORM First person Plural First person Singular Second person Plural Epicene Second person Singular Second person Singular Epicene Third person Plural Singular Third person Plural Neutral Third person Singular Epicene Third person Singular Feminine Third person Singular Masculine Third person Singular Neutral Accusative Artificial Intelligence Anna University K B Chandrasekhar Base line Bi-Lingual Understudy Centre for Applied Linguistics and Translation Studies Central Institute of Indian Languages Cross lingual information retrieval Conditional Random Fields Compressed Word Format Example based Machine Translation Expectation Maximization End of Sentences Finite State Automata Finite State Machine Factored Statistical Machine Translation Finite State Transducer xxi

22 HMM IBM IE IIIT IR KWIC LDC LSV ManTra MBMA MEMM MG MIRA ML MLI MT NIST NLI NLP NLU PBSMT PCFG PER PLIL PN PNG POS POST QA RBMT Hidden Markov Model International Business Machine Information Extraction International Institute of Information Technology Information Retrieval Key word in context Language data Consortium Letter Successor Varieties MAchiNe assisted TRAnslation Memory based Morphological Analysis Maximum Entropy Markov Models Morphological Generator Margin Infused Relaxed Algorithm Machine Learning Morpho-Lexical Information Machine Translation National Institute of Standards and Technology Natural Language Interface Natural Language Processing Natural Language Understanding Phrase based Statistical Machine Translation Probalistic Context Free Grammar Position Independent Word Error Rate Pseudo Lingual for Indian Languages Proper Noun Person-Number-Gender Part-of-Speech Part-of-Speech Tagging Question Answering Rule based Machine Translation xxii

23 RCILTS SMR SMT SOV SRILM SVM SVO TBL TDIL TER TnT UCSG UN VG WER WFR WSJ WWW Resource Centre for Indian Language Technology Solutions Statistical Machine Reordering Statistical Machine Translation Subject-Object-Verb Stanford Research Institute for Language Modeling Support Vector Machine Subject-Verb-Object Transformation based learning Technology Development for Indian Languages Translation Edit Rate Trigrams n Tagger Universal Clause Structure Grammar United Nations Verb Group Word Error Rate Word Formation Rules Wall Street Journal World Wide Web xxiii

24 ABSTRACT Machine translation is about automatic translation of one natural language text to another using computer. In this thesis, morphology based Factored Statistical Machine Translation system (F-SMT) is proposed for translating sentence from English to Tamil. Tamil linguistic tools such as Part-of-Speech Tagger, Morphological Analyzer and Morphological Generator are also developed as a part of this research work. Conventionally, rule-based approaches are employed for developing Machine Translation. It uses transfer-rules between the source language and the target language for producing grammatical translations. The major drawback of this approach is that it always requires the help of a good linguist for the rule improvement. So, recently datadriven approaches such as example-based and statistical based systems are getting more attention from research community. Currently, Statistical Machine Translation (SMT) systems are playing a major role in developing translation between languages. The main advantage of using Statistical Machine Translation system is that it is language independent and it disambiguates the sense automatically with the use of large quantities of parallel corpora. SMT system considers the translation problem as a machine learning problem. Statistical learning methods perform translation based on large amounts of parallel training data. At first, non-structural information and statistical parameters are derived from the bi-lingual corpora. These statistical parameters are then used for translation. Baseline Statistical Machine Translation system considers only surface forms and does not use linguistic knowledge of the languages. Therefore its performance is better for similar language pair when compared to the dissimilar language pair. Translating English into morphologically rich languages is a challenging task. Because of the highly rich morphological nature of Tamil language, a simple lexical mapping alone does not help for retrieving and mapping all the morphological and syntactic information from the English language sentences. Tamil word forms are productive, that is, word forms are written without spaces. Inflected forms of Tamil words are seperate words in Tamil. This leads to the problem of sparse data. It is very difficult to collect or create a parallel corpus which contains all the possible Tamil surface words. Because, a single Tamil root verb is xxiv

25 inflected into more than ten thousand different forms. Moreover, selecting a correct Tamil word or phrase during translation is a challenging job. The corpus size and quality decides the accuracy of the Machine Translation system. The limited availability of parallel corpora for English-Tamil language and high inflectional variation increases the data sparseness problem for baseline phrase-based SMT system. While translating from English to Tamil language, the SMT baseline system will not generate the Tamil word forms that are not present in the training corpora. The proposed Machine Translation system is based on factored Statistical Machine Translation models. The words are factored into lemma and inflected forms based on their part of speech. This factorization reduces the data sparseness in decoding. Factored translation models allow the integration of the linguistic information into a phrase-based translation model. These linguistic features are treated as separate tokens during the factored training process. Baseline SMT system uses untagged corpora for training, whereas factored SMT uses linguistically factored corpora. Pre-processing phase allows including language specific knowledge into the parallel corpus indirectly. In preprocessing, bi-lingual corpora are converted into factored bi-lingual corpora using linguistic tools and reordering rules. Similarly, Tamil language sentences are also pre-processed using the proposed linguistic tools like POS tagger and Morphological analyzer. These factored corpora are then given to the Statistical Machine Translation models for training. Finally, Tamil morphological generator is used for generating a surface word from output factors. xxv

26 CHAPTER 1 INTRODUCTION 1.1 GENERAL Machine Translation is an automatic translation of one natural language text to another using computer. Initial attempts for Machine Translation made in 1950 s didn t meet with success. Now internet users need a fast automatic translation system between languages. Several approaches like Linguistic based and Interlingua based systems are used to develop a machine translation system. But currently, statistical methods dominate the machine translation field. Statistical Machine Translation (SMT) approach draws knowledge from automata theory, artificial intelligence, data structure and statistics. SMT system treats translation as a machine learning problem. This means that a learning algorithm is applied to a large amount of parallel corpora. Parallel corpora are sentences in one language along with its translation. Learning algorithms create a model from parallel sentences and using this model, unseen sentences are translated. If parallel corpora are available for a language pair then it is easy to build a bilingual SMT system. The accuracy of the system is highly dependent on the quality and quantity of the parallel corpus and the domain. These parallel corpora are constantly growing. Parallel corpora are the fundamental resource for SMT system. Parallel corpora are available from government s bi-lingual text books, news papers, websites and novels. SMT models are giving good accuracy for language pairs, particularly for similar languages in specific domains or languages that have large availability of bi-lingual corpora. If a sentence in language pair is not structurally similar, then the translation patterns are difficult to learn. Huge amounts of parallel corpora are required for learning the pattern, therefore statistical methods are difficult to use in less resourced languages. To enhance the translation performance of dissimilar language pairs and less resourced languages, an external preprocessing is required. This preprocessing is performed using linguistic tools. In SMT system, statistical methods are used for mapping of source language phrases into target language phrases. Statistical model parameters are estimated from bi-lingual and mono-lingual corpora. There are two models in the SMT system. They 1

27 are Translation model and Language model. The translation model takes parallel sentences and finds the translation hypothesis between the phrases. Language model is based on the statistical properties of n-grams. It uses the monolingual corpora. Several translation models are available in SMT system. Some important models are phrase based model, syntax based model and factored model. Phrase Based Statistical Machine Translation (PBSMT) is limited to the mapping of small text chunks. Factored translation model is an extension of phrase based models. It integrates linguistic information at the word level. This thesis proposes a pre-processing method that uses linguistic tools to the development of English to Tamil machine translation system. In this translation system, external linguistic tools are used to augment the linguistic information into the parallel corpora. The pre and post processing methodology proposed in this thesis are applicable to other language pairs too. 1.2 OVERVIEW OF MACHINE TRANSLATION Machine translation is one of the major oldest and the most active area in natural language processing. The word translation refers to transformation of text or speech from one language into other. Machine translation can be defined as, the application of computers to the task of translating texts from one natural language to another. It is a focussed field of research in linguistic concepts of syntax, semantics, pragmatics and discourse. Today a number of systems are available for producing translations, though they are not perfect. In the process of translation, which is either carried out manually or automated through machines, the context of the text in the source language when translated must convey the exact context in the target language. Translation is not just word level replacement. A translator, either a machine or human, must interpret and analyse all the elements in the text. Also human/machine should be familiar with all the issues during the translation process and must know how to handle it. This requires indepth knowledge in grammar, sentence structure, meanings, etc and also an understanding in each language s culture in order to handle idioms and phrases originated from different culture. The cross culture understanding is an important issue that holds the accuracy of the translation. 2

28 It will be a great challenge for humans to design automatic machine translation system. It is difficult for translating sentences by taking into consideration all the required information. Humans need several revisions to make the perfect translation. No two individual human translators can generate identical translations of the same text in the same language pair. Hence it will be a greater challenge for humans to design a fully automated machine translation system to produce high quality translations. 1.3 ROLE OF MACHINE TRANSLATION IN NLP Natural Language Processing (NLP) is the field of computer science devoted to the development of models and technologies enabling computers to use human languages both as input and output [1]. The ultimate goal of NLP is to build computational models that equal human performance in the task of reading, writing, learning, speaking and understanding. Computational models are useful to explore the nature of linguistic communication as well as for enabling effective human-machine interaction. Jurafsky and Martin (2005) [2] describe Natural Language Processing as computational techniques that process spoken and written human language as language. According to the Microsoft researchers, the goal of the Natural Language Processing (NLP) is to design and build software that will analyze, understand and generate languages that humans use naturally, so that eventually one will be able to address their computer like addressing another person. Machine Translation is used for translating texts for assimilation purpose which aids bilingual or cross-lingual communication and also for searching, accessing and understanding foreign language information from databases and web-pages [3]. In the field of information retrieval a lot of research is going on in Cross-Language Information Retrieval (CLIR), i.e. information retrieval systems capable of searching databases in many different languages [4]. Construction of robust systems for speech-to-speech translation to facilitate crosslingual oral communication has been the dream of speech and natural language researchers for decades. Machine translation is an important module in speech translation systems. Currently, computer assisted learning plays a major role in academic environment. The use of Machine Translation in language learning has not yet got enough attention because of poor quality of automatic translation output. Using 3

29 good automatic translation system, students can improve their translation and writing skills. Such system can break the language barriers of students and language learners. 1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM Traditionally, rule based approaches are used to develop a machine translation system. Rule based approach feeds the rules into machine using appropriate representations. Feeding all linguistic knowledge into a machine would be very hard. In this context, the statistical approach to Machine Translation has some attractive qualities that made it the preferred approach in machine translation research over the past two decades. Statistical translation models learn translation patterns directly from data, and generalize them to translate a new text. The SMT approach is largely languageindependent, i.e. the models can be applied to any language pair. System based on statistical methods is much better than the traditional rule-based systems. In SMT, implementation and development times are much shorter. SMT can improve by coupling new models for reordering and decoding. It only needs to learn parallel corpora for generating a translation system. In contrast, rule based system needs transfer rules which only linguistic experts can generate. These rules are entirely dependent on language pair involved and defining general transfer-rules is not an easy task, especially for languages with different structures [5]. SMT system can be developed rapidly if the appropriate corpus is available. A Rule Based Machine Translation (RBMT) system requires a lot of development and customization costs until it reaches the desired quality threshold. Packaged RBMT systems have been already developed and it is extremely difficult to reprogram models and equivalences. Above all, RBMT has a much longer process involving more human resources. RBMT system is retrained by adding new rules and vocabulary among other things [5]. Statistical Machine Translation works well for translations in a specific domain with the engine trained with bilingual corpus in that domain. A SMT system requires more computing resources in terms of hardware to train the models. Billions of calculations need to take place during the training of the engine and the computing knowledge required for it is highly specialized. However, training time can be reduced 4

30 nowadays thanks to the wider availability of more powerful computers. RBMT requires a longer deployment and compilation time by experts so that, in principle, building costs are also higher. SMT generates statistical patterns automatically, including a good learning of exceptions to rules. As regards to the rules governing the transfer of RBMT systems, certainly they can be seen as special cases of statistical standards. Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT systems can be upgraded with syntactic information and even semantics, like the RBMT. A SMT engine can generate improved translations if retrained or adapted again. In contrast, the RBMT generates very similar translations after retraining [5]. SMT systems, in general, have trouble in handling the morphology on the source or the target side especially for morphologically rich languages. Errors in morphology can have severe consequences on meaning of the sentence. They change the grammatical function of words or the interpretation of the sentence through the wrong verb tense. Factored translation models try to solve this issue by explicitly handling morphology on the generation side. Another advantage of Statistical Machine Translation system is that, it generates a more natural or closer to the literal translation of the input sentence. Symbolic approaches to machine translation take great human effort in language engineering. In knowledge based machine translation, for example, designers must first find out what kinds of linguistic, general common-sense and domain-specific knowledge is important for a task. Then they have to design an Interlingua representation for the knowledge and write grammars to parse input sentences. Output sentences are generated using the Interlingua representation. All of these require expertise in language technologies and it requires tedious and laborious work. The major advantage of Statistical Machine Translation system is its learnability. As long as a model is set up, it can learn automatically with well-studied algorithms for parameter estimation. Therefore parallel corpus replaces the human expertise for the task. The coverage of grammar is also one of the serious problems in rule based system. Statistical Machine Translation system is a good candidate that meets these criteria. It can learn to have a good coverage as long as the training data is representative enough. It can statistically model the noise in spoken language, so it does not have to make a binary keep/abandon decision and is therefore more robust to noisy data [5]. 5

31 1.5 MOTIVATION OF THE THESIS Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. Even though machine translation was envisioned as a computer application in the 1950 s, machine translation is still considered to be an open problem [3]. The demand for machine translation is growing rapidly. As multilingualism is considered to be a part of democracy, the European Union funds EuroMatrixPlus [6], a project to build machine translation system for all European language pairs, to automatically translate the documents to its 23 official languages, which were being translated manually. Also as the United Nations (UN) is translating a large number of documents into several languages, the UN has created bilingual corpora for some language pairs like Chinese English, Arabic English which are among the largest bilingual corpora distributed through the Linguistic Data Consortium (LDC). In the World Wide Web, as around 20% of web pages and other resources are available in their national languages. Machine Translation can be used to translate these web pages and resources to the required language in order to understand the content in those pages and resources, thereby decreasing the effect of language as a barrier of communication [7]. In a linguistically diverse country like India, machine translation is a very essential technology. Human translation is widely prevalent in India since ancient times which are evident from the various works of philosophy, arts, mythology, religion and science which have been translated among ancient to modern Indian languages. Also, numerous classic works of art, ancient, medieval and modern, have also been translated between European and Indian languages since the 18th century. As of now, human translation in India finds application mainly in the administration, media and education and to a lesser extent in business, arts and science and technology [8]. India has 18 constitutional languages, which are written in 10 different scripts. Hindi is the official language of the India. English is the language which is most widely used in the media, commerce, science and technology and education. Many of the states have their own regional language, which is either Hindi or one of the other constitutional languages. 6

32 In such a situation, there is a big market for translation between English and the various Indian languages. Currently, the translation is done manually. Use of automation is largely restricted to word processing. Two specific examples of high volume manual translation are translation of news from English into local languages, translation of annual reports of government departments and public sector units among English, Hindi and the local language. Many resources such as news, weather reports, books, etc., in English are being manually translated to Indian languages. Of these, News and weather reports from all around the world are translated from English to Indian languages by human translators more often. Human translation is slow and also consumes more time and cost compared to machine translation. It is clear from this that there is large market available for machine translation rather than human translation from English into Indian languages. The reason for choosing automatic machine translation rather than human translation is that machine translation is faster and cheaper than human translation. Tamil, a Dravidian language, is spoken by around 72 million people and has the official status in the state of Tamilnadu and Indian union territory of Puducherry. Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken by significant minorities in Malaysia and Mauritius as well as emigrant communities around the world. It is one of the 22 scheduled languages of India and declared a classical language by the government of India in 2004 [9]. In this thesis a methodology for English to Tamil Statistical Machine Translation is proposed, along with a pre-processing technique. This pre-processing method is used to handle morphological variance between English and Tamil. Linguistic tools are developed to generate linguistically motivated data for the factored translation model for English-Tamil. 1.6 OBJECTIVE OF THE THESIS The main aim of this research is to develop a morphology based prototype Statistical Machine Translation system for English to Tamil language by integrating different linguistic tools. This research will also address the issue of how the morphologically correct sentence is generated when translating from a morphologically simple language into a morphologically rich language. The objective of the research is detailed as follows: 7

33 Develop a pre-processing module (Reordering, Compounding and Factorization) for English language sentence to transform the structure to more similar to that of Tamil. The pre-processing module for source language includes three stages, which are reordering, factorization and compounding. In reordering stage, the source language sentence is to be syntactically reordered according to the Tamil language syntax. After reordering, the English words will be factored into lemma and other morphological features. It will be followed by the compounding process, in which the various function words are removed from the reordered sentence and attached as a morphological factor to the corresponding content word. Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a sentence. Tamil POS tagger is going to develop using Support Vector Machine (SVM) based machine learning tool. POS annotated corpus will be created for training the automatic tagger system. Develop a Morphological Analyser to segment the Tamil surface word into linguistic factors. Morphological analyzer system is to be developed using machine learning approach. POS tagger and morphological analyser tools are to be used for preprocessing the Tamil language sentence. Linguistic information from the tools is to be incorporated to the surface words before SMT training. Build a Morphology based prototype Factored Statistical Machine Translation (F-SMT) system for English to Tamil. After pre-processing, the bi-lingual sentences are to be created and transformed as factored bi-lingual sentences. Monolingual corpora for Tamil are collected and factored using Tamil POS tagger and morphological analyser. These sentences will be used for training the factored Statistical machine translation model. 8

34 Develop a Tamil Morphological Generator system to generate Tamil surface word form. Morphological generator transforms the translation output into grammatically correct target language sentence. Morphological generator is used in post processing module for English to Tamil machine translation system. 1.7 RESEARCH METHODOLOGY Overall System Architecture Tamil is a morphologically rich language with free word-order of Subject-Object- Verb (SOV) pattern. English language is morphologically simple with a fixed word order of Subject-Verb-Object (SVO) pattern. The baseline SMT system would not perform well for the languages with different word order and disparate morphological structure. For resolving this, factored models are introduced in SMT system. The factored model, which is a subtype of SMT system, will allow multiple levels of representation of the word-from the most specific level to more general levels of analysis such as lemma, part-of-speech and morphological features [10]. Figure 1.1 shows the overall architecture of the proposed English to Tamil SMT system. The preprocessing module is externally attached to the factored SMT system. This module converts bilingual corpora into factored bi-lingual corpora using morphology based linguistic tools and reordering rules. After preprocessing, the representations of source language sentence syntax closely follow the sentence structure of target language. This transformation decreases the complexity in alignment, which is also one of the key problems in baseline SMT system. Parallel corpora are used to train the statistical translation models. Parallel corpora are created and converted into factored parallel corpora using preprocessing. English sentences are factored using Stanford Parser tool and Tamil sentences are factored using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected from various news papers and factored using Tamil linguistic tools. This mono-lingual corpus is used in language model. Finally, in post-processing, Tamil morphological generator is used for generating a surface word from output factors. 9

35 Figure 1.1 Morphology based Factored SMT for English to Tamil language Details of Pre-processing English Language Sentence Machine Translation system for language pair with disparate morphological structure needs appropriate pre-processing or modeling before translation. The preprocessing can be performed on the raw source language sentence to make it more appropriate for translating into target language sentence. The pre-processing module for English language sentence consistss of reordering, factorization and compounding Reordering English Language Sentence Reordering means, rearrange the word order of source language sentence into a word order that is closer to that of the target language sentence. It is an important process for languages which differs in their syntactic structure. English and Tamil language pair has disparate syntactic structure. English word order is Subject-Verb- Object (SVO) whereas Tamil word order is Subject-Object-Verb (SOV). For example, the main verb of a Tamil sentence always comes at the end but in English it comes between subject and object [11]. English syntactic relations are retrieved from the Stanford Parser tool. Based on reordering rules source language sentencee is reordered. 10

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,

More information

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing Proc. of Int. Conf. on Advances in Computer Science, AETACS Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing Anand Kumar M 1, Dhanalakshmi

More information

BILINGUAL TRANSLATION SYSTEM

BILINGUAL TRANSLATION SYSTEM BILINGUAL TRANSLATION SYSTEM (FOR ENGLISH AND TAMIL) Dr. S. Saraswathi Associate Professor M. Anusiya P. Kanivadhana S. Sathiya Abstract--- The project aims in developing Bilingual Translation System for

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

TALENT MANAGEMENT PRACTICES AND ITS IMPACT ON ORGANIZATIONAL PRODUCTIVITY: A STUDY WITH REFERENCE TO IT SECTOR IN BENGALURU

TALENT MANAGEMENT PRACTICES AND ITS IMPACT ON ORGANIZATIONAL PRODUCTIVITY: A STUDY WITH REFERENCE TO IT SECTOR IN BENGALURU TALENT MANAGEMENT PRACTICES AND ITS IMPACT ON ORGANIZATIONAL PRODUCTIVITY: A STUDY WITH REFERENCE TO IT SECTOR IN BENGALURU Thesis submitted to BHARATHIAR UNIVERSITY In partial fulfillment of the requirements

More information

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Leveraging ASEAN Economic Community through Language Translation Services

Leveraging ASEAN Economic Community through Language Translation Services Leveraging ASEAN Economic Community through Language Translation Services Hammam Riza Center for Information and Communication Technology Agency for the Assessment and Application of Technology (BPPT)

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Chinese-Japanese Machine Translation Exploiting Chinese Characters

Chinese-Japanese Machine Translation Exploiting Chinese Characters Chinese-Japanese Machine Translation Exploiting Chinese Characters CHENHUI CHU, TOSHIAKI NAKAZAWA, DAISUKE KAWAHARA, and SADAO KUROHASHI, Kyoto University The Chinese and Japanese languages share Chinese

More information

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

A Machine Translation System Between a Pair of Closely Related Languages

A Machine Translation System Between a Pair of Closely Related Languages A Machine Translation System Between a Pair of Closely Related Languages Kemal Altintas 1,3 1 Dept. of Computer Engineering Bilkent University Ankara, Turkey email:kemal@ics.uci.edu Abstract Machine translation

More information

Hybrid Machine Translation Guided by a Rule Based System

Hybrid Machine Translation Guided by a Rule Based System Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION Dr. Siddhartha Ghosh 1, Sujata Thamke 2 and Kalyani U.R.S 3 1 Head of the Department of Computer Science & Engineering,

More information

Overview of MT techniques. Malek Boualem (FT)

Overview of MT techniques. Malek Boualem (FT) Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:

More information

Language and Computation

Language and Computation Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University tamas.biro@yale.edu http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters

More information

Special Topics in Computer Science

Special Topics in Computer Science Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS

More information

MOVING MACHINE TRANSLATION SYSTEM TO WEB

MOVING MACHINE TRANSLATION SYSTEM TO WEB MOVING MACHINE TRANSLATION SYSTEM TO WEB Abstract GURPREET SINGH JOSAN Dept of IT, RBIEBT, Mohali. Punjab ZIP140104,India josangurpreet@rediffmail.com The paper presents an overview of an online system

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Hybrid Strategies. for better products and shorter time-to-market

Hybrid Strategies. for better products and shorter time-to-market Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,

More information

English Grammar Checker

English Grammar Checker International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,

More information

Probabilistic topic models for sentiment analysis on the Web

Probabilistic topic models for sentiment analysis on the Web University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as

More information

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and

More information

a Chinese-to-Spanish rule-based machine translation

a Chinese-to-Spanish rule-based machine translation Chinese-to-Spanish rule-based machine translation system Jordi Centelles 1 and Marta R. Costa-jussà 2 1 Centre de Tecnologies i Aplicacions del llenguatge i la Parla (TALP), Universitat Politècnica de

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

Regulation On Attainment of Doctor of Sciences Degree at SEEU (PhD)

Regulation On Attainment of Doctor of Sciences Degree at SEEU (PhD) According to article 118 of the Law on Higher Education of Republic of Macedonia; articles 60, 68 and 69 of SEEU statute ; based on decision of Council of Teaching and Science of SEEU of date April 12th

More information

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

MODERN WRITTEN ARABIC. Volume I. Hosted for free on livelingua.com

MODERN WRITTEN ARABIC. Volume I. Hosted for free on livelingua.com MODERN WRITTEN ARABIC Volume I Hosted for free on livelingua.com TABLE OF CcmmTs PREFACE. Page iii INTRODUCTICN vi Lesson 1 1 2.6 3 14 4 5 6 22 30.45 7 55 8 61 9 69 10 11 12 13 96 107 118 14 134 15 16

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

LIUM s Statistical Machine Translation System for IWSLT 2010

LIUM s Statistical Machine Translation System for IWSLT 2010 LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

Context Grammar and POS Tagging

Context Grammar and POS Tagging Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 dick.chen@lexisnexis.com don.loritz@lexisnexis.com

More information

Learning Translation Rules from Bilingual English Filipino Corpus

Learning Translation Rules from Bilingual English Filipino Corpus Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no.

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. 248347 Deliverable D5.4 Report on requirements, implementation

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information

Master of Arts in Linguistics Syllabus

Master of Arts in Linguistics Syllabus Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

PROMT Technologies for Translation and Big Data

PROMT Technologies for Translation and Big Data PROMT Technologies for Translation and Big Data Overview and Use Cases Julia Epiphantseva PROMT About PROMT EXPIRIENCED Founded in 1991. One of the world leading machine translation provider DIVERSIFIED

More information

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering

More information

Machine Translation. Agenda

Machine Translation. Agenda Agenda Introduction to Machine Translation Data-driven statistical machine translation Translation models Parallel corpora Document-, sentence-, word-alignment Phrase-based translation MT decoding algorithm

More information

Speech and Language Processing

Speech and Language Processing Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Second Edition Daniel Jurafsky Stanford University James H. Martin University

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

T U R K A L A T O R 1

T U R K A L A T O R 1 T U R K A L A T O R 1 A Suite of Tools for Augmenting English-to-Turkish Statistical Machine Translation by Gorkem Ozbek [gorkem@stanford.edu] Siddharth Jonathan [jonsid@stanford.edu] CS224N: Natural Language

More information

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

How To Translate English To Yoruba Language To Yoranuva

How To Translate English To Yoruba Language To Yoranuva International Journal of Language and Linguistics 2015; 3(3): 154-159 Published online May 11, 2015 (http://www.sciencepublishinggroup.com/j/ijll) doi: 10.11648/j.ijll.20150303.17 ISSN: 2330-0205 (Print);

More information

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN Yu Chen, Andreas Eisele DFKI GmbH, Saarbrücken, Germany May 28, 2010 OUTLINE INTRODUCTION ARCHITECTURE EXPERIMENTS CONCLUSION SMT VS. RBMT [K.

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter

More information

M LTO Multilingual On-Line Translation

M LTO Multilingual On-Line Translation O non multa, sed multum M LTO Multilingual On-Line Translation MOLTO Consortium FP7-247914 Project summary MOLTO s goal is to develop a set of tools for translating texts between multiple languages in

More information

Introduction. Philipp Koehn. 28 January 2016

Introduction. Philipp Koehn. 28 January 2016 Introduction Philipp Koehn 28 January 2016 Administrativa 1 Class web site: http://www.mt-class.org/jhu/ Tuesdays and Thursdays, 1:30-2:45, Hodson 313 Instructor: Philipp Koehn (with help from Matt Post)

More information

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that

More information

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group Practical Data Mining Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor Ei Francis Group, an Informs

More information

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1

More information

Chapter 5. Phrase-based models. Statistical Machine Translation

Chapter 5. Phrase-based models. Statistical Machine Translation Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many

More information

The KIT Translation system for IWSLT 2010

The KIT Translation system for IWSLT 2010 The KIT Translation system for IWSLT 2010 Jan Niehues 1, Mohammed Mediani 1, Teresa Herrmann 1, Michael Heck 2, Christian Herff 2, Alex Waibel 1 Institute of Anthropomatics KIT - Karlsruhe Institute of

More information

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5 Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5 English Fever, Fire and Fashion Unit Summary In this historical Unit pupils learn about everyday life in London during the 17 th Century. Frost fairs,

More information

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande

More information

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks

More information

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles Why language is hard And what Linguistics has to say about it Natalia Silveira Participation code: eagles Christopher Natalia Silveira Manning Language processing is so easy for humans that it is like

More information

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging 2 Words taken isolatedly are ambiguous regarding its POS Yo bajo con el hombre bajo a PP AQ

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

A Method for Automatic De-identification of Medical Records

A Method for Automatic De-identification of Medical Records A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA tafvizi@csail.mit.edu Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA mpacula@csail.mit.edu Abstract

More information

Factored Translation Models

Factored Translation Models Factored Translation s Philipp Koehn and Hieu Hoang pkoehn@inf.ed.ac.uk, H.Hoang@sms.ed.ac.uk School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom

More information

Rule based Sentence Simplification for English to Tamil Machine Translation System

Rule based Sentence Simplification for English to Tamil Machine Translation System Volume 25 No8, July 2011 Rule based Sentence Simplification for English to Tamil Machine Translation System Poornima C, Dhanalakshmi V Computational Engineering and Networking Amrita Vishwa Vidyapeetham

More information

Maskinöversättning 2008. F2 Översättningssvårigheter + Översättningsstrategier

Maskinöversättning 2008. F2 Översättningssvårigheter + Översättningsstrategier Maskinöversättning 2008 F2 Översättningssvårigheter + Översättningsstrategier Flertydighet i källspråket poäng point, points, credit, credits, var verb ->was, were pron -> each adv -> where adj -> every

More information

Automatic Text Analysis Using Drupal

Automatic Text Analysis Using Drupal Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing

More information

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature CS4025: Pragmatics Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature For more info: J&M, chap 18,19 in 1 st ed; 21,24 in 2 nd Computing Science, University of

More information

1. The degree of Doctor of Philosophy may be granted in any Faculty of the University.

1. The degree of Doctor of Philosophy may be granted in any Faculty of the University. Ordinance Vl-B. Doctor of Philosophy (Ph.D) 1. The degree of Doctor of Philosophy may be granted in any Faculty of the University. 2. All academic matters related to this degree shall be supervised by

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Translating while Parsing

Translating while Parsing Gábor Prószéky Translating while Parsing Abstract Translations unconsciously rely on the hypothesis that sentences of a certain source language L S can be expressed by sentences of the target language

More information

How To Write A Summary Of A Review

How To Write A Summary Of A Review PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

More information

Convergence of Translation Memory and Statistical Machine Translation

Convergence of Translation Memory and Statistical Machine Translation Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past

More information

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed

More information

How To Evaluate The Performance Of The Process Industry Supply Chain

How To Evaluate The Performance Of The Process Industry Supply Chain Performance Evaluation of the Process Industry Supply r Chain: Case of the Petroleum Industry in India :.2A By Siddharth Varma Submitted in fulfillment of requirements of the degree of DOCTOR OF PHILOSOPHY

More information

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology 1 P age Dr. Maruf Hasan MS CIS Program Potential Project

More information

Discovering suffixes: A Case Study for Marathi Language

Discovering suffixes: A Case Study for Marathi Language Discovering suffixes: A Case Study for Marathi Language Mudassar M. Majgaonker Comviva Technologies Limited Gurgaon, India Abstract Suffix stripping is a pre-processing step required in a number of natural

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

How To Extract Content From Thai Websites

How To Extract Content From Thai Websites An Integrated Approach for Content Extraction, Word Segmentation and Information Presentation from Thai Websites Wigrai Thanadechteemapat This thesis is presented for the degree of Doctor of Philosophy

More information

Specialty Answering Service. All rights reserved.

Specialty Answering Service. All rights reserved. 0 Contents 1 Introduction... 2 1.1 Types of Dialog Systems... 2 2 Dialog Systems in Contact Centers... 4 2.1 Automated Call Centers... 4 3 History... 3 4 Designing Interactive Dialogs with Structured Data...

More information