MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE



Similar documents
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing

BILINGUAL TRANSLATION SYSTEM

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Customizing an English-Korean Machine Translation System for Patent Translation *

Statistical Machine Translation

TALENT MANAGEMENT PRACTICES AND ITS IMPACT ON ORGANIZATIONAL PRODUCTIVITY: A STUDY WITH REFERENCE TO IT SECTOR IN BENGALURU

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Chapter 8. Final Results on Dutch Senseval-2 Test Data

PoS-tagging Italian texts with CORISTagger

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Leveraging ASEAN Economic Community through Language Translation Services

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Building a Question Classifier for a TREC-Style Question Answering System

Collecting Polish German Parallel Corpora in the Internet

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Chinese-Japanese Machine Translation Exploiting Chinese Characters

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Word Completion and Prediction in Hebrew

A Machine Translation System Between a Pair of Closely Related Languages

Hybrid Machine Translation Guided by a Rule Based System

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

Overview of MT techniques. Malek Boualem (FT)

Natural Language to Relational Query by Using Parsing Compiler

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Language and Computation

Special Topics in Computer Science

MOVING MACHINE TRANSLATION SYSTEM TO WEB

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Hybrid Strategies. for better products and shorter time-to-market

English Grammar Checker

Probabilistic topic models for sentiment analysis on the Web

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

a Chinese-to-Spanish rule-based machine translation

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Regulation On Attainment of Doctor of Sciences Degree at SEEU (PhD)

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Micro blogs Oriented Word Segmentation System

MODERN WRITTEN ARABIC. Volume I. Hosted for free on livelingua.com

Natural Language Database Interface for the Community Based Monitoring System *

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

LIUM s Statistical Machine Translation System for IWSLT 2010

Tagging with Hidden Markov Models

Context Grammar and POS Tagging

Learning Translation Rules from Bilingual English Filipino Corpus

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Master of Arts in Linguistics Syllabus

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

PROMT Technologies for Translation and Big Data

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Machine Translation. Agenda

Speech and Language Processing

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

T U R K A L A T O R 1

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Turkish Radiology Dictation System

THUTR: A Translation Retrieval System

How To Translate English To Yoruba Language To Yoranuva

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Interactive Dynamic Information Extraction

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

M LTO Multilingual On-Line Translation

Introduction. Philipp Koehn. 28 January 2016

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Chapter 5. Phrase-based models. Statistical Machine Translation

The KIT Translation system for IWSLT 2010

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

A Method for Automatic De-identification of Medical Records

Factored Translation Models

Rule based Sentence Simplification for English to Tamil Machine Translation System

Maskinöversättning F2 Översättningssvårigheter + Översättningsstrategier

Automatic Text Analysis Using Drupal

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

1. The degree of Doctor of Philosophy may be granted in any Faculty of the University.

Brill s rule-based PoS tagger

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Translating while Parsing

How To Write A Summary Of A Review

Convergence of Translation Memory and Statistical Machine Translation

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

How To Evaluate The Performance Of The Process Industry Supply Chain

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology

Discovering suffixes: A Case Study for Marathi Language

Mining the Software Change Repository of a Legacy Telephony System

How To Extract Content From Thai Websites

Specialty Answering Service. All rights reserved.

Transcription:

MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE A Thesis Submitted for the Degree of Doctor of Philosophy in the School of Engineering by ANAND KUMAR M CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM COIMBATORE-641 112, TAMILNADU, INDIA April, 2013

AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE-641 112 BONAFIDE CERTIFICATE This is to certify that the thesis entitled MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE submitted by Mr. ANAND KUMAR M, Reg. No. CB.EN.D*CEN08002 for the award of the Degree of Doctor of Philosophy in the School of Engineering is a bonafide record of the work carried out by him under my guidance and supervision at Amrita School of Engineering, Coimbatore. Thesis Advisor Dr. K.P.SOMAN Professor and Head, Center for Excellence in Computational Engineering and Networking.

AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE 641 112 CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING DECLARATION I, ANAND KUMAR M (Reg. No. CB.EN.D*CEN08002) hereby declare that this thesis entitled MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE is the record of the original work done by me under the guidance of Dr. K.P. SOMAN, Professor and Head, Center for Excellence in Computational Engineering and Networking, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree/diploma/associateship/fellowship or a similar award, to any candidate in any University. Place: Coimbatore Date: Signature of the Student COUNTERSIGNED Thesis Advisor Dr. K.P.SOMAN Professor and Head Center for Excellence in Computational Engineering and Networking

TABLE OF CONTENTS ACKNOWLEDGEMENT... xii LIST OF FIGURES... xv LIST OF TABLES... xviii ABBREVIATIONS... xxi ABSTRACT... xxiv 1 INTRODUCTION... 1 1.1 GENERAL... 1 1.2 OVERVIEW OF MACHINE TRANSLATION... 2 1.3 ROLE OF MACHINE TRANSLATION IN NLP... 3 1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM... 4 1.5 MOTIVATION OF THE THESIS... 6 1.6 OBJECTIVE OF THE THESIS... 7 1.7 RESEARCH METHODOLOGY... 9 1.7.1 Overall System Architecture... 9 1.7.2 Details of Preprocessing English Language Sentence... 10 1.7.2.1 Reordering English Language Sentence... 10 1.7.2.2 Factorization of English Language Sentence... 11 1.7.2.3 Compounding of English Language Sentence... 11 1.7.3 Details of Preprocessing Tamil Language Sentence... 12 1.7.3.1 Tamil Part-of-Speech Tagger... 13 1.7.3.2 Tamil Morphological Analyzer... 13 1.7.4 Factored SMT System for English to Tamil Language... 14 1.7.5 Postprocessing for English to Tamil SMT... 15 1.7.5.1 Tamil Morphological Generator... 15 1.8 RESEARCH CONTRIBUTIONS... 16 1.9 ORGANISATION OF THE THESIS... 17 2 LITERATURE SURVEY... 19 2.1 PART OF SPEECH TAGGER... 19 iv

2.1.1 Part Of Speech Tagger for Indian Languages... 21 2.1.2 Part Of Speech Tagger for Tamil Language... 23 2.2 MORPHOLOGICAL ANALYZER AND GENERATOR... 25 2.2.1 Morphological Analyzer and Generator for Indian Languages... 26 2.2.2 Morphological Analyzer and Generator for Tamil Language... 26 2.3 MACHINE TRANSLATION SYSTEMS... 30 2.3.1 Machine Translation Systems for Indian Languages... 30 2.3.2 Machine Translation Systems for Tamil Language... 35 2.4 ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM... 38 2.5 RELATED NLP WORKS IN TAMIL... 43 2.6 SUMMARY... 46 3 THEORITICAL BACKGROUND... 47 3.1 GENERAL... 47 3.1.1 Tamil Language... 47 3.1.2 Tamil Grammar... 48 3.1.3 Tamil Characters... 49 3.1.4 Morphological Richness of Tamil Language... 50 3.1.5 Challenges in Tamil NLP... 51 3.1.5.1 Ambiguity in Morpheme... 51 3.1.5.2 Ambiguity in Word Class... 52 3.1.5.3 Ambiguity in Word Sense... 52 3.1.5.4 Ambiguity in Sentence... 53 3.2 MORPHOLOGY... 53 3.2.1 Types of Morphology... 53 3.2.2 Lexemes... 54 3.2.3 Lemma and Stems... 54 3.2.4 Inflections and Word forms... 55 3.2.5 Morphemes and Types... 55 3.2.6 Allomorphs... 56 3.2.7 Morpho-Phonemics... 56 3.2.8 Morphotactics... 57 3.3 MACHINE LEARNING FOR NLP... 58 v

3.3.1 Machine Learning... 58 3.3.2 Support Vector Machines... 59 3.3.3 Geometrical Interpretation of SVM... 61 3.3.4 SVM Formulation... 64 3.4 VARIOUS APPROACHES FOR POS TAGGING... 67 3.4.1 Supervised POS Tagging... 67 3.4.2 Unsupervised POS Tagging... 68 3.4.3 Rule based POS Tagging... 68 3.4.4 Stochastic POS Tagging... 69 3.4.5 Other Techniques... 69 3.5 VARIOUS APPROACHES FOR MORPHOLOGICAL ANALYZER... 70 3.5.1 Two level Morphological Analysis... 70 3.5.2 Unsupervised Morphological Analyser... 71 3.5.3 Memory based Morphological Analysis... 72 3.5.4 Stemmer based Approach... 72 3.5.5 Suffix Stripping based Approach... 72 3.6 VARIOUS APPROACHES IN MACHINE TRANSLATION... 73 3.6.1 Linguistic or Rule based Approaches... 73 3.6.1.1 Direct Approach... 74 3.6.1.2 Interlingua Approach... 76 3.6.1.3 Transfer Approach... 77 3.6.2 Non Linguistic Approaches... 79 3.6.2.1 Dictionary based Approach... 79 3.6.2.2 Empirical or Corpus based Approach... 79 3.6.2.3 Example based Approach... 80 3.6.2.4 Statistical Approach... 81 3.6.3 Hybrid Machine Translation System... 82 3.7 EVALUATING STATISTICAL MACHINE TRANSLATION... 83 3.7.1 Human Evaluation Techniques... 84 3.7.2 Automatic Evaluation Techniques... 85 3.7.2.1 BLEU Score... 85 3.7.2.2 NIST Metric... 86 3.7.2.3 Precision and Recall... 87 vi

3.7.2.4 Edit Distance Measures... 88 3.8 SUMMARY... 89 4 PREPROCESSING FOR ENGLISH LANGUAGE... 90 4.1 MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE.. 90 4.1.1 POS and Lemma and Information... 91 4.1.2 Syntactic Information... 92 4.1.3 Dependency Information... 93 4.2 DETAILS OF PREPROCESSING ENGLISH SENTENCES... 94 4.2.1 Reordering English Sentences... 96 4.2.1.1 Syntactic Comparision between English and Tamil... 97 4.2.1.2 Reordering Methodology... 98 4.2.2 Factoring English Sentence... 102 4.2.3 Compounding English Language Sentence... 105 4.2.3.1 Morphological Comparision between English and Tamil106 4.2.3.2 Compounding Methodology for English Sentence... 109 4.2.4 Integrating Reordering and Compounding... 113 4.3 SUMMARY... 115 5 PART OF SPEECH TAGGER FOR TAMIL LANGUAGE... 117 5.1 GENERAL... 117 5.1.1 Part of Speech Tagging... 117 5.1.2 Tamil POS Tagging... 120 5.2 COMPLEXITY IN TAMIL POS TAGGING... 122 5.2.1 Root Ambiguity... 122 5.2.2 Noun Complexity... 122 5.2.3 Verb Complexity... 123 5.2.4 Adverb Complexity... 125 5.2.5 Postposition Complexity... 126 5.3 PART OF SPEECH TAGSET DEVELOPMENT... 126 5.3.1 Available POS Tagsets for Tamil... 127 5.3.2 AMRITA POS Tagset... 128 5.4 DEVELOPMENT OF TAMIL POS CORPORA FOR PREPROCESSING129 vii

5.4.1 Untagged and Tagged Corpus... 130 5.4.2 Available Corpus for Tamil... 131 5.4.3 POS Tagged Corpus Development... 131 5.4.4 Applications of Tagged Corpus... 134 5.4.5 Details of POS Tagged corpus developed... 134 5.5 DEVELOPMENT OF POS TAGGER USING SVMTOOL... 136 5.5.1 SVMTool... 136 5.5.2 Features of SVMTool... 137 5.5.3 Components of SVMTool... 138 5.5.3.1 SVMTlearn... 138 5.5.3.2 SVMTagger... 146 5.5.3.3 SVMTeval... 151 5.6 RESULTS AND COMPARISON WITH OTHER TOOLS... 160 5.7 ERROR ANALYSIS... 161 5.8 SUMMARY... 162 6 MORPHOLOGICAL ANALYZER FOR TAMIL... 163 6.1 GENERAL... 163 6.1.1 Morphology in Language... 163 6.1.2 Computational Morphology... 163 6.1.3 Morphological Analyzer... 164 6.1.4 Role of Morphological Analyzer in NLP... 165 6.2 TAMIL MORPHOLOGY... 166 6.2.1 Tamil Morphology and Language... 166 6.2.2 Syntax of Tamil Morphology... 167 6.2.3 Word Formation Rules(WFR) in Tamil... 168 6.2.4 Tamil Verb Morphology... 171 6.2.5 Tamil Noun Morphology... 172 6.2.6 Tamil Morphological Analyzer... 175 6.2.7 Challenges in Tamil Morphological Analzer... 175 6.3 TAMIL MORPHOLOGICAL ANALYZER SYSTEM... 176 6.4 TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND VERBS 177 6.4.1 Morphological Analyzer using Machine Learning... 177 viii

6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer... 179 6.4.2.1 Paradigm Classification... 179 6.4.2.2 Word forms... 180 6.4.2.3 Morphemes... 183 6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer.. 186 6.4.2.5 Issues in Data Creation... 188 6.4.3 Morphological Tagging Framework using SVMTool... 189 6.4.3.1 Support Vector Machine (SVM)... 189 6.4.3.2 SVMTool... 189 6.4.3.3 Implementation of Morphological Analyzer System... 190 6.5 MORPH ANALYZER FOR PRONOUN USING PATTERNS... 192 6.6 MORPH ANALYZER FOR PROPER NOUN USING SUFFIXES... 194 6.7 RESULTS AND EVALUATION... 195 6.8 PREPROCESSED ENGLISH AND TAMIL SENTENCE... 198 6.9 SUMMARY... 198 7 FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL... 200 7.1 STATISTICAL MACHINE TRANSLATION... 200 7.2 COMPONENTS OF SMT... 201 7.2.1 Translation Model... 202 7.2.1.1 Expectation Maximization... 202 7.2.1.2 Word based Translation Model... 203 7.2.1.3 Phrase based Translation Model... 204 7.2.2 Language Model... 206 7.2.2.1 N-gram Language Models... 208 7.2.3 Statistical Machine Translation Decoder... 210 7.3 INTEGRATING LINGUISTIC INFORMATION IN SMT... 210 7.3.1 Factored Translation Models... 210 7.3.1.1 Decomposition of Factored Translation... 212 7.3.2 Syntax based Translation Models... 212 7.4 TOOLS USED IN SMT SYSTEM... 213 7.4.1 MOSES... 213 7.4.2 GIZA++ & MKCLS... 214 ix

7.4.3 SRILM... 214 7.5 DEVELOPMENT OF FACTORED CORPORA... 215 7.5.1 Parallel Corpora Collection... 215 7.5.2 Monolingual Corpora Collection... 216 7.5.3 Automatic Creation of Factored Corpora... 216 7.6 FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE... 217 7.6.1 Building Language Model... 218 7.6.2 Building Phrase based Translation Model... 219 7.7 SUMMARY... 221 8 POSTPROCESSING FOR ENGLISH TO TAMIL SMT... 222 8.1 GENERAL... 222 8.2 MORPHOLOGICAL GENERATOR... 223 8.2.1 Challenges in Tamil Morphological Generator... 223 8.2.2 Simplified Part-of-Speech Catagories... 225 8.3 MORPHOLOGICAL GENERATOR FOR TAMIL NOUN AND VERB. 226 8.3.1 Algorithm for Noun and Verb Morphological Generator... 227 8.3.2 Word-forms Handled in Morphological Generator... 229 8.3.3 Data Required for the Algorithm... 230 8.3.3.1 Morpho Lexical Information File... 230 8.3.3.2 Paradigm Classification Rules... 232 8.3.3.3 Suffix Table... 234 8.3.3.4 Stemming Rules... 235 8.4 MORPHOLOGICAL GENERATOR FOR TAMIL PRONOUNS... 236 8.5 SUMMARY... 238 9 EXPERIMENTS AND RESULTS... 240 9.1 GENERAL... 240 9.2 EXPERIMENTAL SETUP AND RESULTS... 240 9.3 SUMMARY... 245 10 CONCLUSION AND FUTUREWORK... 246 10.1 SUMMARY... 247 x

10.2 SUMMARY OF WORK DONE... 247 10.3 CONCLUSIONS... 249 10.4 FUTURE DIRECTIONS... 250 APPENDIX-A... 252 A.1 TAMIL TRANSLITERATION... 252 A.2 DETAILS OF AMRITA POS TAGS... 256 APPENDIX-B... 264 B.1 PENN TREE BANK POS TAGS... 264 B.2 DEPENDENCY TAGS... 265 B.3 TAMIL VERB MLI... 266 B.4 TAMIL NOUN WORD FORM... 272 B.5 TAMIL VERB WORD FORM... 275 B.6 MOSES INSTALLATION AND TRAINING... 280 B.7 COMPARISION WITH GOOGLE OUTPUT... 285 B.8 GRAPHICAL USER INTERFACES... 286 REFERENCES... 290 AUTHOR S PUBLICATIONS... 310 xi

ACKNOWLEDGEMENT I would never have been able to finish my dissertation without the guidance, support and encouragement of numerous people including my mentors, my friends, colleagues and support from my family and wife. At the end of my thesis I would like to thank all those people who made this thesis possible and an unforgettable experience for me. First and foremost, I feel deeply indebted to Her Holiness Most Revered Mata Amritanandamayi Devi (Amma) for her inspiration and guidance throughout of my doctoral studies, both in unseen and unconcealed ways. Wholeheartedly, I thank our respected Pro Chancellor, Swami Abhayamrita Chaitanya, by providing the necessary environment, infrastructure and encouragement for my research in Amrita Vishwa Vidyapeetham University. I thank Dr. P. Venkat Rangan, our respected Vice Chancellor, for his full hearted encouragements and supports throughout my doctoral studies. I would like to express my sincere gratitude to my supervisor, Dr. K.P Soman, Professor and Head, Centre for Excellence in Computational Engineering and Networking (CEN), for his excellent guidance, patience, and providing an excellent atmosphere for doing research. His wide knowledge and logical way of thinking have been of great source of inspiration for me. I am really so happy and proud to say that I am a student of Dr.K.P.Soman. He has always extended his helping hands in solving research problems. The in-depth discussions, scholarly supervision and constructive suggestions received from him have broadened my knowledge. I strongly believe that without his guidance, the present work could have not reached this stage. I wish to thank my doctoral committee members Dr.C.S Shunmuga Velayutham and Dr.V.P.Mohandass, for their encouraging words and support throughout this research. I express my heartfelt gratitude to Dr.N.S.Pandian, Dean, PG Programmes, Amrita Vishwa Vidyapeetham, and Coimbatore, for the continuous support of my Ph.D study and research. xii

I wish to thank Dr.S.Rajendran for his supervision, advice, and guidance from the very early stage of this research as well as giving me extraordinary experiences through-out the work. I express my deepest gratitude to Mrs.V.Dhanalakshmi, Head of the Department-Tamil, SRM University, Chennai. Whatever knowledge I have gained in linguistic is definitely because of her. I also wish to thank my school teacher Mr. B. Vaithiyanathan M.Sc M.Ed for supporting me from School days. I would like to thank Mr. Arun Sankar K, who as a good friend from my graduate is always willing to help and give his best suggestions. I express my sincere gratitude to my beloved Director, Dr.K.A.Chinnaraju, and Principal, Dr N.Nagarajan, CIET for giving me all the moral support to complete the thesis successfully. I would like to express my gratitude to my Head of the Department Dr.S.Gunasekaran, who is always inspiring me to complete this thesis work. I would also like to thank Mr.G.Ravi Kumar and Prof. Mrs.Janaki Kumar for their timely support and suggestions. I would like to thank my colleagues at the department of Computer science and engineering, especially Mr. N.Ramkumar, Mr.N.Boopal, Mr.A.Suresh, Mr.M.Yogesh, Mr.C.Prabu, and Mr.B.Saravanan for sharing their enthusiasm and for supporting me from the beginning of my career at CIET. I wish to express my warm and sincere thanks to Dr. Mrs. M.S Vijaya, HOD (MCA), GRD Krishnamal College for Women and Dr.M. Sabarimalai Manikandan, SAMSUNG Electronics, for their kind support and direction which have been of great value in this study. My sincere thanks also goes to Mr.Sivaprathap, Mr.Rakesh Peter, Mr.Loganathan and Mr.Antony P J, Mr.Ajit, Mr Saravanan, Mr.Kathir, Mr. Senthil, Mr.V Anand Kumar, Mrs. Latha Menon, and Sampath Kumar CEN department for supporting me in all the ways. I also express my sense of gratitude to my friends Ms.Resmi N.G and Ms.Preeja for their encouragement and guidance. My research would not have been possible without the help of my friends C.Murugesan, S.Ramakrishnan, S.Mohanraj and A.Baladhandapani, I like to thank them for being with me in all circumstances. xiii

I wish to give a special thank to my friends Mrs. Rekha Kishore, Mr.C. Arun Kumar, Mrs. Padmavathy and Mr.Tirumeni for supporting me in this research. I would like to thank to my Grandpa Mr.M.Narayanasamy and Mr. A.Peter who left us too soon. I hope that this work will make them proud. I would like to thank my uncle Mr.P.M.Palraj and aunt Mrs.P.Rajeswari for their encouragement and motivation during my difficult moments during the long years of my education. I would also like to express deepest gratitude to my Grandma Mrs.N.Valliyammal and my uncles Mr.N.Natesapandiyan and Mr.N.Pandiyan for supporting me from my school days. I want to thank my parents Mr. N. Madasamy and Mrs. M.Manohari for their kind support, the confidence and the love they have shown to me. You have been my greatest strength and I am blessed to be your son. I would also like to give a special thanks to my beloved brother Mr.M.Vasanthkumar for his support to me in all ways. I wish to thank my sister Mrs.S.Arthi and her husband Mr.K.Suresh, for supporting me in all the ways. I would like to thank my father-in-law Mr.P.Velusamy, and mother-in-law Mrs.V. Ponnuthai, without their encouragement and moral support it would have been impossible for me to finish this work. Finally, I would like to give a special thank to my wife Mrs.Sharmiladevi V. She is always there for cheering me up at difficult times with great patience. Without her love and support it would have been impossible for me to finish this work. -ANAND KUMAR M xiv

LIST OF FIGURES Figure 1.1 Morphology based Factored SMT for English to Tamil Language... 10 Figure 1.2 Reordering of English Language... 11 Figure 1.3 Mapping English Word Factors to Tamil Word Factors... 14 Figure 1.4 Thesis Organizations... 17 Figure 3.1 Maximum Margin and Support Vectors... 62 Figure 3.2 Training Errors in Support Vector Machine... 63 Figure 3.3 Non-linear Classifier... 64 Figure 3.4 Classification of POS Tagging Models... 67 Figure 3.5 Two Level Morphology... 71 Figure 3.6 Block Diagram of Direct Approach to Machine Translation... 75 Figure.3.7 The Vauquios Triangle... 77 Figure 3.8 Block Diagram of Transfer Approach... 78 Figure 3.9 Block Diagram of EBMT System... 80 Figure 3.10 Block Diagram of SMT System... 81 Figure 3.11 Rule based Translation System with Post-processing... 83 Figure 3.12 Statistical Machine Translation System with Pre-processing... 83 Figure 4.1 Example of English Syntactic Tree... 92 Figure 4.2 Preprocessing Stages of English Sentence... 95 Figure 4.3 Process of Reordering... 99 Figure 4.4 English Syntactic Tree... 101 Figure 4.5 English to Tamil Alignment... 110 Figure 4.6 Block Diagram for Compounding... 111 Figure 4.7 Integration Process... 114 Figure 5.1 Example of Untagged Corpus... 130 xv

Figure 5.2 Example of Tagged Corpus... 130 Figure 5.3 Untagged Corpus before Pre-editing... 132 Figure 5.4 Untagged Corpus after Pre-editing... 133 Figure 5.5 Training Data Format... 139 Figure 5.6 Implementation of SVMTlearn... 143 Figure 5.7 Example Input... 149 Figure 5.8 Example Output... 149 Figure 5.9 Implementation of SVMTagger... 150 Figure 5.10 Implementation of SVMTeval... 152 Figure 6.1 Role of Morphological Analyzer in NLP... 166 Figure 6.3 General Framework for Morphological Analyzer System... 176 Figure 6.4 Preprocessing Steps... 187 Figure 6.5 Implementation of Noun/Verb Morph Analyzer... 191 Figure 6.6 Structure of Pronoun Word form... 192 Figure 6.7 Implementation of Pronoun Morph Analyzer... 193 Figure 6.8 Implementation of Proper Noun Morph Analyzer... 195 Figure 6.9 Training Data Vs Accuracy... 196 Figure 7.1 The Noisy Channel Model to Machine Translation... 201 Figure 7.2 Block Diagram for Factored Translation... 211 Figure 7.3 Mapping English Factors to Tamil Factors... 280 Figure 8.1 Tamil Sentence Generation... 225 Figure 8.2 Algorithm for Morphological Generator... 227 Figure 8.3 Architecture of Tamil Morphological Generator... 228 Figure 8.4 Pseudo Code for Paradigm Classification... 233 Figure 8.5 Structure of Pronoun Word form... 237 Figure 8.6 Pronoun Morphological Generator... 238 xvi

Figure 9.1 BLEU-1 Score for Various Models... 243 Figure 9.2 BLEU-4 Score for Various Models... 244 Figure 9.3 NIST Score for Various Models... 244 Figure 9.4 Google Translation System... 245 xvii

LIST OF TABLES Table 1.1 Factored English Sentences... 12 Table 1.2 Compounded English Sentences... 12 Table 3.1 Tamil Grammar... 48 Table 3.2 Tamil Vowels... 49 Table 3.3 Tamil Compound Letters... 50 Table 3.4 Ambiguity in Morpheme s Position... 52 Table 3.5 An Example to Illustrate the Direct Approach... 75 Table 3.6 An Example for Interlingua Representation... 76 Table 3.7 An Example for Transfer Approach... 79 Table 3.8 Example of English and Tamil Sentences... 81 Table 3.9 Scales of Evaluation... 85 Table 4.1 POS and Lemma of Words... 91 Table 4.2 Reordering Rules... 100 Table 4.3 Original and Reordered Sentences... 102 Table 4.4 Description of Factors in English Word... 103 Table 4.5 Example of English Word Factors... 104 Table 4.6 Factored Representation of English Language Sentence... 104 Table 4.7 Word forms of English... 106 Table 4.8 Content Words of English... 107 Table 4.9 Function Words of English... 107 Table 4.10 English Word Forms based on Tenses... 108 Table 4.11 Tamil Word Forms based on Tenses... 109 Table 4.12 Compounding Rules for English Sentence... 112 Table 4.13 Average Words per Sentence... 113 Table 4.14 Factored English Sentence... 113 xviii

Table 4.15 Compounded English Sentence... 113 Table 4.16 Preprocessed English Sentences... 115 Table 5.1 AMRITA POS Tagset... 129 Table 5.2 Tag Count... 134 Table 5.3 Corpus Statistics... 135 Table 5.4 Example of Suitable POS Features for Model 0... 141 Table 5.5 Example of Suitable POS Features for Model 1... 141 Table 5.6 Example of Suitable POS Features for Model 2... 142 Table 5.7 Comparison of Accuracies... 161 Table 5.8 Trials and Error... 162 Table 5.9 Confusion Matrix... 162 Table 6.1 Compound Word-forms Formation... 171 Table 6.2 Simple Verb Finite Forms... 172 Table 6.3 Noun Case Markers... 173 Table 6.4 Minimized POS Tagset... 177 Table 6.5 Number of Paradigms and Inflections... 180 Table 6.6 Noun Paradigms... 180 Table 6.7 Verb Paradigms... 181 Table 6.8 Noun Word Forms... 181 Table 6.9 Verb Word Forms... 182 Table 6.10 Noun Morphemes... 183 Table 6.11 Verb Morphemes... 184 Table 6.12 Verb/Noun Ambiguous Morphemes... 185 Table 6.13 Sample Data Format... 187 Table 6.14 Example of Proper Noun Inflections... 195 Table 6.15 Tagged Vs Untagged Accuracies... 196 Table 6.16 Number of Words and Characters and Level of Efficiencies... 197 xix

Table 6.17 Sentence Level Accuracies... 198 Table 6.18 Preprocessed English and Tamil Sentence... 198 Table 7.1 Factored Parallel Sentences... 217 Table 8.1 Morpho-phonemic Changes... 224 Table 8.2 Simplified POS Tagset... 225 Table 8.3 Verb and Noun Word Forms... 229 Table 8.4 MLI for Tamil Verb... 231 Table 8.5 Look up Table for Paradigm Classification... 233 Table 8.6 Paradigms and inflections... 234 Table 8.7 Suffix Table... 235 Table 8.8 Stemming End Characters... 236 Table 9.1 Details of Baseline Parallel Corpora... 241 Table 9.2 Details of Factored Parallel Corpora... 241 Table 9.3 BLEU and NIST Scores... 243 Table 10.1 Mapping of Major Research Outcome to Publications... 248 xx

LIST OF ABBREVIATIONS ABBREVIATIONS 1PL 1S 2PE 2S 2SE 3PE 3PN 3SE 3SF 3SM 3SN ACC AI AU-KBC BL BLEU CALTS CIIL CLIR CRF CWF EBMT EM EOS FSA FSM FSMT FST FULL FORM First person Plural First person Singular Second person Plural Epicene Second person Singular Second person Singular Epicene Third person Plural Singular Third person Plural Neutral Third person Singular Epicene Third person Singular Feminine Third person Singular Masculine Third person Singular Neutral Accusative Artificial Intelligence Anna University K B Chandrasekhar Base line Bi-Lingual Understudy Centre for Applied Linguistics and Translation Studies Central Institute of Indian Languages Cross lingual information retrieval Conditional Random Fields Compressed Word Format Example based Machine Translation Expectation Maximization End of Sentences Finite State Automata Finite State Machine Factored Statistical Machine Translation Finite State Transducer xxi

HMM IBM IE IIIT IR KWIC LDC LSV ManTra MBMA MEMM MG MIRA ML MLI MT NIST NLI NLP NLU PBSMT PCFG PER PLIL PN PNG POS POST QA RBMT Hidden Markov Model International Business Machine Information Extraction International Institute of Information Technology Information Retrieval Key word in context Language data Consortium Letter Successor Varieties MAchiNe assisted TRAnslation Memory based Morphological Analysis Maximum Entropy Markov Models Morphological Generator Margin Infused Relaxed Algorithm Machine Learning Morpho-Lexical Information Machine Translation National Institute of Standards and Technology Natural Language Interface Natural Language Processing Natural Language Understanding Phrase based Statistical Machine Translation Probalistic Context Free Grammar Position Independent Word Error Rate Pseudo Lingual for Indian Languages Proper Noun Person-Number-Gender Part-of-Speech Part-of-Speech Tagging Question Answering Rule based Machine Translation xxii

RCILTS SMR SMT SOV SRILM SVM SVO TBL TDIL TER TnT UCSG UN VG WER WFR WSJ WWW Resource Centre for Indian Language Technology Solutions Statistical Machine Reordering Statistical Machine Translation Subject-Object-Verb Stanford Research Institute for Language Modeling Support Vector Machine Subject-Verb-Object Transformation based learning Technology Development for Indian Languages Translation Edit Rate Trigrams n Tagger Universal Clause Structure Grammar United Nations Verb Group Word Error Rate Word Formation Rules Wall Street Journal World Wide Web xxiii

ABSTRACT Machine translation is about automatic translation of one natural language text to another using computer. In this thesis, morphology based Factored Statistical Machine Translation system (F-SMT) is proposed for translating sentence from English to Tamil. Tamil linguistic tools such as Part-of-Speech Tagger, Morphological Analyzer and Morphological Generator are also developed as a part of this research work. Conventionally, rule-based approaches are employed for developing Machine Translation. It uses transfer-rules between the source language and the target language for producing grammatical translations. The major drawback of this approach is that it always requires the help of a good linguist for the rule improvement. So, recently datadriven approaches such as example-based and statistical based systems are getting more attention from research community. Currently, Statistical Machine Translation (SMT) systems are playing a major role in developing translation between languages. The main advantage of using Statistical Machine Translation system is that it is language independent and it disambiguates the sense automatically with the use of large quantities of parallel corpora. SMT system considers the translation problem as a machine learning problem. Statistical learning methods perform translation based on large amounts of parallel training data. At first, non-structural information and statistical parameters are derived from the bi-lingual corpora. These statistical parameters are then used for translation. Baseline Statistical Machine Translation system considers only surface forms and does not use linguistic knowledge of the languages. Therefore its performance is better for similar language pair when compared to the dissimilar language pair. Translating English into morphologically rich languages is a challenging task. Because of the highly rich morphological nature of Tamil language, a simple lexical mapping alone does not help for retrieving and mapping all the morphological and syntactic information from the English language sentences. Tamil word forms are productive, that is, word forms are written without spaces. Inflected forms of Tamil words are seperate words in Tamil. This leads to the problem of sparse data. It is very difficult to collect or create a parallel corpus which contains all the possible Tamil surface words. Because, a single Tamil root verb is xxiv

inflected into more than ten thousand different forms. Moreover, selecting a correct Tamil word or phrase during translation is a challenging job. The corpus size and quality decides the accuracy of the Machine Translation system. The limited availability of parallel corpora for English-Tamil language and high inflectional variation increases the data sparseness problem for baseline phrase-based SMT system. While translating from English to Tamil language, the SMT baseline system will not generate the Tamil word forms that are not present in the training corpora. The proposed Machine Translation system is based on factored Statistical Machine Translation models. The words are factored into lemma and inflected forms based on their part of speech. This factorization reduces the data sparseness in decoding. Factored translation models allow the integration of the linguistic information into a phrase-based translation model. These linguistic features are treated as separate tokens during the factored training process. Baseline SMT system uses untagged corpora for training, whereas factored SMT uses linguistically factored corpora. Pre-processing phase allows including language specific knowledge into the parallel corpus indirectly. In preprocessing, bi-lingual corpora are converted into factored bi-lingual corpora using linguistic tools and reordering rules. Similarly, Tamil language sentences are also pre-processed using the proposed linguistic tools like POS tagger and Morphological analyzer. These factored corpora are then given to the Statistical Machine Translation models for training. Finally, Tamil morphological generator is used for generating a surface word from output factors. xxv

CHAPTER 1 INTRODUCTION 1.1 GENERAL Machine Translation is an automatic translation of one natural language text to another using computer. Initial attempts for Machine Translation made in 1950 s didn t meet with success. Now internet users need a fast automatic translation system between languages. Several approaches like Linguistic based and Interlingua based systems are used to develop a machine translation system. But currently, statistical methods dominate the machine translation field. Statistical Machine Translation (SMT) approach draws knowledge from automata theory, artificial intelligence, data structure and statistics. SMT system treats translation as a machine learning problem. This means that a learning algorithm is applied to a large amount of parallel corpora. Parallel corpora are sentences in one language along with its translation. Learning algorithms create a model from parallel sentences and using this model, unseen sentences are translated. If parallel corpora are available for a language pair then it is easy to build a bilingual SMT system. The accuracy of the system is highly dependent on the quality and quantity of the parallel corpus and the domain. These parallel corpora are constantly growing. Parallel corpora are the fundamental resource for SMT system. Parallel corpora are available from government s bi-lingual text books, news papers, websites and novels. SMT models are giving good accuracy for language pairs, particularly for similar languages in specific domains or languages that have large availability of bi-lingual corpora. If a sentence in language pair is not structurally similar, then the translation patterns are difficult to learn. Huge amounts of parallel corpora are required for learning the pattern, therefore statistical methods are difficult to use in less resourced languages. To enhance the translation performance of dissimilar language pairs and less resourced languages, an external preprocessing is required. This preprocessing is performed using linguistic tools. In SMT system, statistical methods are used for mapping of source language phrases into target language phrases. Statistical model parameters are estimated from bi-lingual and mono-lingual corpora. There are two models in the SMT system. They 1

are Translation model and Language model. The translation model takes parallel sentences and finds the translation hypothesis between the phrases. Language model is based on the statistical properties of n-grams. It uses the monolingual corpora. Several translation models are available in SMT system. Some important models are phrase based model, syntax based model and factored model. Phrase Based Statistical Machine Translation (PBSMT) is limited to the mapping of small text chunks. Factored translation model is an extension of phrase based models. It integrates linguistic information at the word level. This thesis proposes a pre-processing method that uses linguistic tools to the development of English to Tamil machine translation system. In this translation system, external linguistic tools are used to augment the linguistic information into the parallel corpora. The pre and post processing methodology proposed in this thesis are applicable to other language pairs too. 1.2 OVERVIEW OF MACHINE TRANSLATION Machine translation is one of the major oldest and the most active area in natural language processing. The word translation refers to transformation of text or speech from one language into other. Machine translation can be defined as, the application of computers to the task of translating texts from one natural language to another. It is a focussed field of research in linguistic concepts of syntax, semantics, pragmatics and discourse. Today a number of systems are available for producing translations, though they are not perfect. In the process of translation, which is either carried out manually or automated through machines, the context of the text in the source language when translated must convey the exact context in the target language. Translation is not just word level replacement. A translator, either a machine or human, must interpret and analyse all the elements in the text. Also human/machine should be familiar with all the issues during the translation process and must know how to handle it. This requires indepth knowledge in grammar, sentence structure, meanings, etc and also an understanding in each language s culture in order to handle idioms and phrases originated from different culture. The cross culture understanding is an important issue that holds the accuracy of the translation. 2

It will be a great challenge for humans to design automatic machine translation system. It is difficult for translating sentences by taking into consideration all the required information. Humans need several revisions to make the perfect translation. No two individual human translators can generate identical translations of the same text in the same language pair. Hence it will be a greater challenge for humans to design a fully automated machine translation system to produce high quality translations. 1.3 ROLE OF MACHINE TRANSLATION IN NLP Natural Language Processing (NLP) is the field of computer science devoted to the development of models and technologies enabling computers to use human languages both as input and output [1]. The ultimate goal of NLP is to build computational models that equal human performance in the task of reading, writing, learning, speaking and understanding. Computational models are useful to explore the nature of linguistic communication as well as for enabling effective human-machine interaction. Jurafsky and Martin (2005) [2] describe Natural Language Processing as computational techniques that process spoken and written human language as language. According to the Microsoft researchers, the goal of the Natural Language Processing (NLP) is to design and build software that will analyze, understand and generate languages that humans use naturally, so that eventually one will be able to address their computer like addressing another person. Machine Translation is used for translating texts for assimilation purpose which aids bilingual or cross-lingual communication and also for searching, accessing and understanding foreign language information from databases and web-pages [3]. In the field of information retrieval a lot of research is going on in Cross-Language Information Retrieval (CLIR), i.e. information retrieval systems capable of searching databases in many different languages [4]. Construction of robust systems for speech-to-speech translation to facilitate crosslingual oral communication has been the dream of speech and natural language researchers for decades. Machine translation is an important module in speech translation systems. Currently, computer assisted learning plays a major role in academic environment. The use of Machine Translation in language learning has not yet got enough attention because of poor quality of automatic translation output. Using 3

good automatic translation system, students can improve their translation and writing skills. Such system can break the language barriers of students and language learners. 1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM Traditionally, rule based approaches are used to develop a machine translation system. Rule based approach feeds the rules into machine using appropriate representations. Feeding all linguistic knowledge into a machine would be very hard. In this context, the statistical approach to Machine Translation has some attractive qualities that made it the preferred approach in machine translation research over the past two decades. Statistical translation models learn translation patterns directly from data, and generalize them to translate a new text. The SMT approach is largely languageindependent, i.e. the models can be applied to any language pair. System based on statistical methods is much better than the traditional rule-based systems. In SMT, implementation and development times are much shorter. SMT can improve by coupling new models for reordering and decoding. It only needs to learn parallel corpora for generating a translation system. In contrast, rule based system needs transfer rules which only linguistic experts can generate. These rules are entirely dependent on language pair involved and defining general transfer-rules is not an easy task, especially for languages with different structures [5]. SMT system can be developed rapidly if the appropriate corpus is available. A Rule Based Machine Translation (RBMT) system requires a lot of development and customization costs until it reaches the desired quality threshold. Packaged RBMT systems have been already developed and it is extremely difficult to reprogram models and equivalences. Above all, RBMT has a much longer process involving more human resources. RBMT system is retrained by adding new rules and vocabulary among other things [5]. Statistical Machine Translation works well for translations in a specific domain with the engine trained with bilingual corpus in that domain. A SMT system requires more computing resources in terms of hardware to train the models. Billions of calculations need to take place during the training of the engine and the computing knowledge required for it is highly specialized. However, training time can be reduced 4

nowadays thanks to the wider availability of more powerful computers. RBMT requires a longer deployment and compilation time by experts so that, in principle, building costs are also higher. SMT generates statistical patterns automatically, including a good learning of exceptions to rules. As regards to the rules governing the transfer of RBMT systems, certainly they can be seen as special cases of statistical standards. Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT systems can be upgraded with syntactic information and even semantics, like the RBMT. A SMT engine can generate improved translations if retrained or adapted again. In contrast, the RBMT generates very similar translations after retraining [5]. SMT systems, in general, have trouble in handling the morphology on the source or the target side especially for morphologically rich languages. Errors in morphology can have severe consequences on meaning of the sentence. They change the grammatical function of words or the interpretation of the sentence through the wrong verb tense. Factored translation models try to solve this issue by explicitly handling morphology on the generation side. Another advantage of Statistical Machine Translation system is that, it generates a more natural or closer to the literal translation of the input sentence. Symbolic approaches to machine translation take great human effort in language engineering. In knowledge based machine translation, for example, designers must first find out what kinds of linguistic, general common-sense and domain-specific knowledge is important for a task. Then they have to design an Interlingua representation for the knowledge and write grammars to parse input sentences. Output sentences are generated using the Interlingua representation. All of these require expertise in language technologies and it requires tedious and laborious work. The major advantage of Statistical Machine Translation system is its learnability. As long as a model is set up, it can learn automatically with well-studied algorithms for parameter estimation. Therefore parallel corpus replaces the human expertise for the task. The coverage of grammar is also one of the serious problems in rule based system. Statistical Machine Translation system is a good candidate that meets these criteria. It can learn to have a good coverage as long as the training data is representative enough. It can statistically model the noise in spoken language, so it does not have to make a binary keep/abandon decision and is therefore more robust to noisy data [5]. 5

1.5 MOTIVATION OF THE THESIS Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. Even though machine translation was envisioned as a computer application in the 1950 s, machine translation is still considered to be an open problem [3]. The demand for machine translation is growing rapidly. As multilingualism is considered to be a part of democracy, the European Union funds EuroMatrixPlus [6], a project to build machine translation system for all European language pairs, to automatically translate the documents to its 23 official languages, which were being translated manually. Also as the United Nations (UN) is translating a large number of documents into several languages, the UN has created bilingual corpora for some language pairs like Chinese English, Arabic English which are among the largest bilingual corpora distributed through the Linguistic Data Consortium (LDC). In the World Wide Web, as around 20% of web pages and other resources are available in their national languages. Machine Translation can be used to translate these web pages and resources to the required language in order to understand the content in those pages and resources, thereby decreasing the effect of language as a barrier of communication [7]. In a linguistically diverse country like India, machine translation is a very essential technology. Human translation is widely prevalent in India since ancient times which are evident from the various works of philosophy, arts, mythology, religion and science which have been translated among ancient to modern Indian languages. Also, numerous classic works of art, ancient, medieval and modern, have also been translated between European and Indian languages since the 18th century. As of now, human translation in India finds application mainly in the administration, media and education and to a lesser extent in business, arts and science and technology [8]. India has 18 constitutional languages, which are written in 10 different scripts. Hindi is the official language of the India. English is the language which is most widely used in the media, commerce, science and technology and education. Many of the states have their own regional language, which is either Hindi or one of the other constitutional languages. 6

In such a situation, there is a big market for translation between English and the various Indian languages. Currently, the translation is done manually. Use of automation is largely restricted to word processing. Two specific examples of high volume manual translation are translation of news from English into local languages, translation of annual reports of government departments and public sector units among English, Hindi and the local language. Many resources such as news, weather reports, books, etc., in English are being manually translated to Indian languages. Of these, News and weather reports from all around the world are translated from English to Indian languages by human translators more often. Human translation is slow and also consumes more time and cost compared to machine translation. It is clear from this that there is large market available for machine translation rather than human translation from English into Indian languages. The reason for choosing automatic machine translation rather than human translation is that machine translation is faster and cheaper than human translation. Tamil, a Dravidian language, is spoken by around 72 million people and has the official status in the state of Tamilnadu and Indian union territory of Puducherry. Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken by significant minorities in Malaysia and Mauritius as well as emigrant communities around the world. It is one of the 22 scheduled languages of India and declared a classical language by the government of India in 2004 [9]. In this thesis a methodology for English to Tamil Statistical Machine Translation is proposed, along with a pre-processing technique. This pre-processing method is used to handle morphological variance between English and Tamil. Linguistic tools are developed to generate linguistically motivated data for the factored translation model for English-Tamil. 1.6 OBJECTIVE OF THE THESIS The main aim of this research is to develop a morphology based prototype Statistical Machine Translation system for English to Tamil language by integrating different linguistic tools. This research will also address the issue of how the morphologically correct sentence is generated when translating from a morphologically simple language into a morphologically rich language. The objective of the research is detailed as follows: 7

Develop a pre-processing module (Reordering, Compounding and Factorization) for English language sentence to transform the structure to more similar to that of Tamil. The pre-processing module for source language includes three stages, which are reordering, factorization and compounding. In reordering stage, the source language sentence is to be syntactically reordered according to the Tamil language syntax. After reordering, the English words will be factored into lemma and other morphological features. It will be followed by the compounding process, in which the various function words are removed from the reordered sentence and attached as a morphological factor to the corresponding content word. Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a sentence. Tamil POS tagger is going to develop using Support Vector Machine (SVM) based machine learning tool. POS annotated corpus will be created for training the automatic tagger system. Develop a Morphological Analyser to segment the Tamil surface word into linguistic factors. Morphological analyzer system is to be developed using machine learning approach. POS tagger and morphological analyser tools are to be used for preprocessing the Tamil language sentence. Linguistic information from the tools is to be incorporated to the surface words before SMT training. Build a Morphology based prototype Factored Statistical Machine Translation (F-SMT) system for English to Tamil. After pre-processing, the bi-lingual sentences are to be created and transformed as factored bi-lingual sentences. Monolingual corpora for Tamil are collected and factored using Tamil POS tagger and morphological analyser. These sentences will be used for training the factored Statistical machine translation model. 8

Develop a Tamil Morphological Generator system to generate Tamil surface word form. Morphological generator transforms the translation output into grammatically correct target language sentence. Morphological generator is used in post processing module for English to Tamil machine translation system. 1.7 RESEARCH METHODOLOGY 1.7.1 Overall System Architecture Tamil is a morphologically rich language with free word-order of Subject-Object- Verb (SOV) pattern. English language is morphologically simple with a fixed word order of Subject-Verb-Object (SVO) pattern. The baseline SMT system would not perform well for the languages with different word order and disparate morphological structure. For resolving this, factored models are introduced in SMT system. The factored model, which is a subtype of SMT system, will allow multiple levels of representation of the word-from the most specific level to more general levels of analysis such as lemma, part-of-speech and morphological features [10]. Figure 1.1 shows the overall architecture of the proposed English to Tamil SMT system. The preprocessing module is externally attached to the factored SMT system. This module converts bilingual corpora into factored bi-lingual corpora using morphology based linguistic tools and reordering rules. After preprocessing, the representations of source language sentence syntax closely follow the sentence structure of target language. This transformation decreases the complexity in alignment, which is also one of the key problems in baseline SMT system. Parallel corpora are used to train the statistical translation models. Parallel corpora are created and converted into factored parallel corpora using preprocessing. English sentences are factored using Stanford Parser tool and Tamil sentences are factored using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected from various news papers and factored using Tamil linguistic tools. This mono-lingual corpus is used in language model. Finally, in post-processing, Tamil morphological generator is used for generating a surface word from output factors. 9

Figure 1.1 Morphology based Factored SMT for English to Tamil language 1.7.2 Details of Pre-processing English Language Sentence Machine Translation system for language pair with disparate morphological structure needs appropriate pre-processing or modeling before translation. The preprocessing can be performed on the raw source language sentence to make it more appropriate for translating into target language sentence. The pre-processing module for English language sentence consistss of reordering, factorization and compounding. 1.7.2.1 Reordering English Language Sentence Reordering means, rearrange the word order of source language sentence into a word order that is closer to that of the target language sentence. It is an important process for languages which differs in their syntactic structure. English and Tamil language pair has disparate syntactic structure. English word order is Subject-Verb- Object (SVO) whereas Tamil word order is Subject-Object-Verb (SOV). For example, the main verb of a Tamil sentence always comes at the end but in English it comes between subject and object [11]. English syntactic relations are retrieved from the Stanford Parser tool. Based on reordering rules source language sentencee is reordered. 10