MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE

Transcription

1 MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE A Thesis Submitted for the Degree of Doctor of Philosophy in the School of Engineering by ANAND KUMAR M CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM COIMBATORE , TAMILNADU, INDIA April, 2013

2 AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE BONAFIDE CERTIFICATE This is to certify that the thesis entitled MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE submitted by Mr. ANAND KUMAR M, Reg. No. CB.EN.D*CEN08002 for the award of the Degree of Doctor of Philosophy in the School of Engineering is a bonafide record of the work carried out by him under my guidance and supervision at Amrita School of Engineering, Coimbatore. Thesis Advisor Dr. K.P.SOMAN Professor and Head, Center for Excellence in Computational Engineering and Networking.

3 AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING DECLARATION I, ANAND KUMAR M (Reg. No. CB.EN.D*CEN08002) hereby declare that this thesis entitled MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE is the record of the original work done by me under the guidance of Dr. K.P. SOMAN, Professor and Head, Center for Excellence in Computational Engineering and Networking, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree/diploma/associateship/fellowship or a similar award, to any candidate in any University. Place: Coimbatore Date: Signature of the Student COUNTERSIGNED Thesis Advisor Dr. K.P.SOMAN Professor and Head Center for Excellence in Computational Engineering and Networking

4 TABLE OF CONTENTS ACKNOWLEDGEMENT... xii LIST OF FIGURES... xv LIST OF TABLES... xviii ABBREVIATIONS... xxi ABSTRACT... xxiv 1 INTRODUCTION GENERAL OVERVIEW OF MACHINE TRANSLATION ROLE OF MACHINE TRANSLATION IN NLP FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM MOTIVATION OF THE THESIS OBJECTIVE OF THE THESIS RESEARCH METHODOLOGY Overall System Architecture Details of Preprocessing English Language Sentence Reordering English Language Sentence Factorization of English Language Sentence Compounding of English Language Sentence Details of Preprocessing Tamil Language Sentence Tamil Part-of-Speech Tagger Tamil Morphological Analyzer Factored SMT System for English to Tamil Language Postprocessing for English to Tamil SMT Tamil Morphological Generator RESEARCH CONTRIBUTIONS ORGANISATION OF THE THESIS LITERATURE SURVEY PART OF SPEECH TAGGER iv

5 2.1.1 Part Of Speech Tagger for Indian Languages Part Of Speech Tagger for Tamil Language MORPHOLOGICAL ANALYZER AND GENERATOR Morphological Analyzer and Generator for Indian Languages Morphological Analyzer and Generator for Tamil Language MACHINE TRANSLATION SYSTEMS Machine Translation Systems for Indian Languages Machine Translation Systems for Tamil Language ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM RELATED NLP WORKS IN TAMIL SUMMARY THEORITICAL BACKGROUND GENERAL Tamil Language Tamil Grammar Tamil Characters Morphological Richness of Tamil Language Challenges in Tamil NLP Ambiguity in Morpheme Ambiguity in Word Class Ambiguity in Word Sense Ambiguity in Sentence MORPHOLOGY Types of Morphology Lexemes Lemma and Stems Inflections and Word forms Morphemes and Types Allomorphs Morpho-Phonemics Morphotactics MACHINE LEARNING FOR NLP v

6 3.3.1 Machine Learning Support Vector Machines Geometrical Interpretation of SVM SVM Formulation VARIOUS APPROACHES FOR POS TAGGING Supervised POS Tagging Unsupervised POS Tagging Rule based POS Tagging Stochastic POS Tagging Other Techniques VARIOUS APPROACHES FOR MORPHOLOGICAL ANALYZER Two level Morphological Analysis Unsupervised Morphological Analyser Memory based Morphological Analysis Stemmer based Approach Suffix Stripping based Approach VARIOUS APPROACHES IN MACHINE TRANSLATION Linguistic or Rule based Approaches Direct Approach Interlingua Approach Transfer Approach Non Linguistic Approaches Dictionary based Approach Empirical or Corpus based Approach Example based Approach Statistical Approach Hybrid Machine Translation System EVALUATING STATISTICAL MACHINE TRANSLATION Human Evaluation Techniques Automatic Evaluation Techniques BLEU Score NIST Metric Precision and Recall vi

7 Edit Distance Measures SUMMARY PREPROCESSING FOR ENGLISH LANGUAGE MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE POS and Lemma and Information Syntactic Information Dependency Information DETAILS OF PREPROCESSING ENGLISH SENTENCES Reordering English Sentences Syntactic Comparision between English and Tamil Reordering Methodology Factoring English Sentence Compounding English Language Sentence Morphological Comparision between English and Tamil Compounding Methodology for English Sentence Integrating Reordering and Compounding SUMMARY PART OF SPEECH TAGGER FOR TAMIL LANGUAGE GENERAL Part of Speech Tagging Tamil POS Tagging COMPLEXITY IN TAMIL POS TAGGING Root Ambiguity Noun Complexity Verb Complexity Adverb Complexity Postposition Complexity PART OF SPEECH TAGSET DEVELOPMENT Available POS Tagsets for Tamil AMRITA POS Tagset DEVELOPMENT OF TAMIL POS CORPORA FOR PREPROCESSING129 vii

8 5.4.1 Untagged and Tagged Corpus Available Corpus for Tamil POS Tagged Corpus Development Applications of Tagged Corpus Details of POS Tagged corpus developed DEVELOPMENT OF POS TAGGER USING SVMTOOL SVMTool Features of SVMTool Components of SVMTool SVMTlearn SVMTagger SVMTeval RESULTS AND COMPARISON WITH OTHER TOOLS ERROR ANALYSIS SUMMARY MORPHOLOGICAL ANALYZER FOR TAMIL GENERAL Morphology in Language Computational Morphology Morphological Analyzer Role of Morphological Analyzer in NLP TAMIL MORPHOLOGY Tamil Morphology and Language Syntax of Tamil Morphology Word Formation Rules(WFR) in Tamil Tamil Verb Morphology Tamil Noun Morphology Tamil Morphological Analyzer Challenges in Tamil Morphological Analzer TAMIL MORPHOLOGICAL ANALYZER SYSTEM TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND VERBS Morphological Analyzer using Machine Learning viii

9 6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer Paradigm Classification Word forms Morphemes Data Creation for Noun/Verb Morphological Analyzer Issues in Data Creation Morphological Tagging Framework using SVMTool Support Vector Machine (SVM) SVMTool Implementation of Morphological Analyzer System MORPH ANALYZER FOR PRONOUN USING PATTERNS MORPH ANALYZER FOR PROPER NOUN USING SUFFIXES RESULTS AND EVALUATION PREPROCESSED ENGLISH AND TAMIL SENTENCE SUMMARY FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL STATISTICAL MACHINE TRANSLATION COMPONENTS OF SMT Translation Model Expectation Maximization Word based Translation Model Phrase based Translation Model Language Model N-gram Language Models Statistical Machine Translation Decoder INTEGRATING LINGUISTIC INFORMATION IN SMT Factored Translation Models Decomposition of Factored Translation Syntax based Translation Models TOOLS USED IN SMT SYSTEM MOSES GIZA++ & MKCLS ix

10 7.4.3 SRILM DEVELOPMENT OF FACTORED CORPORA Parallel Corpora Collection Monolingual Corpora Collection Automatic Creation of Factored Corpora FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE Building Language Model Building Phrase based Translation Model SUMMARY POSTPROCESSING FOR ENGLISH TO TAMIL SMT GENERAL MORPHOLOGICAL GENERATOR Challenges in Tamil Morphological Generator Simplified Part-of-Speech Catagories MORPHOLOGICAL GENERATOR FOR TAMIL NOUN AND VERB Algorithm for Noun and Verb Morphological Generator Word-forms Handled in Morphological Generator Data Required for the Algorithm Morpho Lexical Information File Paradigm Classification Rules Suffix Table Stemming Rules MORPHOLOGICAL GENERATOR FOR TAMIL PRONOUNS SUMMARY EXPERIMENTS AND RESULTS GENERAL EXPERIMENTAL SETUP AND RESULTS SUMMARY CONCLUSION AND FUTUREWORK SUMMARY x

11 10.2 SUMMARY OF WORK DONE CONCLUSIONS FUTURE DIRECTIONS APPENDIX-A A.1 TAMIL TRANSLITERATION A.2 DETAILS OF AMRITA POS TAGS APPENDIX-B B.1 PENN TREE BANK POS TAGS B.2 DEPENDENCY TAGS B.3 TAMIL VERB MLI B.4 TAMIL NOUN WORD FORM B.5 TAMIL VERB WORD FORM B.6 MOSES INSTALLATION AND TRAINING B.7 COMPARISION WITH GOOGLE OUTPUT B.8 GRAPHICAL USER INTERFACES REFERENCES AUTHOR S PUBLICATIONS xi

12 ACKNOWLEDGEMENT I would never have been able to finish my dissertation without the guidance, support and encouragement of numerous people including my mentors, my friends, colleagues and support from my family and wife. At the end of my thesis I would like to thank all those people who made this thesis possible and an unforgettable experience for me. First and foremost, I feel deeply indebted to Her Holiness Most Revered Mata Amritanandamayi Devi (Amma) for her inspiration and guidance throughout of my doctoral studies, both in unseen and unconcealed ways. Wholeheartedly, I thank our respected Pro Chancellor, Swami Abhayamrita Chaitanya, by providing the necessary environment, infrastructure and encouragement for my research in Amrita Vishwa Vidyapeetham University. I thank Dr. P. Venkat Rangan, our respected Vice Chancellor, for his full hearted encouragements and supports throughout my doctoral studies. I would like to express my sincere gratitude to my supervisor, Dr. K.P Soman, Professor and Head, Centre for Excellence in Computational Engineering and Networking (CEN), for his excellent guidance, patience, and providing an excellent atmosphere for doing research. His wide knowledge and logical way of thinking have been of great source of inspiration for me. I am really so happy and proud to say that I am a student of Dr.K.P.Soman. He has always extended his helping hands in solving research problems. The in-depth discussions, scholarly supervision and constructive suggestions received from him have broadened my knowledge. I strongly believe that without his guidance, the present work could have not reached this stage. I wish to thank my doctoral committee members Dr.C.S Shunmuga Velayutham and Dr.V.P.Mohandass, for their encouraging words and support throughout this research. I express my heartfelt gratitude to Dr.N.S.Pandian, Dean, PG Programmes, Amrita Vishwa Vidyapeetham, and Coimbatore, for the continuous support of my Ph.D study and research. xii

13 I wish to thank Dr.S.Rajendran for his supervision, advice, and guidance from the very early stage of this research as well as giving me extraordinary experiences through-out the work. I express my deepest gratitude to Mrs.V.Dhanalakshmi, Head of the Department-Tamil, SRM University, Chennai. Whatever knowledge I have gained in linguistic is definitely because of her. I also wish to thank my school teacher Mr. B. Vaithiyanathan M.Sc M.Ed for supporting me from School days. I would like to thank Mr. Arun Sankar K, who as a good friend from my graduate is always willing to help and give his best suggestions. I express my sincere gratitude to my beloved Director, Dr.K.A.Chinnaraju, and Principal, Dr N.Nagarajan, CIET for giving me all the moral support to complete the thesis successfully. I would like to express my gratitude to my Head of the Department Dr.S.Gunasekaran, who is always inspiring me to complete this thesis work. I would also like to thank Mr.G.Ravi Kumar and Prof. Mrs.Janaki Kumar for their timely support and suggestions. I would like to thank my colleagues at the department of Computer science and engineering, especially Mr. N.Ramkumar, Mr.N.Boopal, Mr.A.Suresh, Mr.M.Yogesh, Mr.C.Prabu, and Mr.B.Saravanan for sharing their enthusiasm and for supporting me from the beginning of my career at CIET. I wish to express my warm and sincere thanks to Dr. Mrs. M.S Vijaya, HOD (MCA), GRD Krishnamal College for Women and Dr.M. Sabarimalai Manikandan, SAMSUNG Electronics, for their kind support and direction which have been of great value in this study. My sincere thanks also goes to Mr.Sivaprathap, Mr.Rakesh Peter, Mr.Loganathan and Mr.Antony P J, Mr.Ajit, Mr Saravanan, Mr.Kathir, Mr. Senthil, Mr.V Anand Kumar, Mrs. Latha Menon, and Sampath Kumar CEN department for supporting me in all the ways. I also express my sense of gratitude to my friends Ms.Resmi N.G and Ms.Preeja for their encouragement and guidance. My research would not have been possible without the help of my friends C.Murugesan, S.Ramakrishnan, S.Mohanraj and A.Baladhandapani, I like to thank them for being with me in all circumstances. xiii

14 I wish to give a special thank to my friends Mrs. Rekha Kishore, Mr.C. Arun Kumar, Mrs. Padmavathy and Mr.Tirumeni for supporting me in this research. I would like to thank to my Grandpa Mr.M.Narayanasamy and Mr. A.Peter who left us too soon. I hope that this work will make them proud. I would like to thank my uncle Mr.P.M.Palraj and aunt Mrs.P.Rajeswari for their encouragement and motivation during my difficult moments during the long years of my education. I would also like to express deepest gratitude to my Grandma Mrs.N.Valliyammal and my uncles Mr.N.Natesapandiyan and Mr.N.Pandiyan for supporting me from my school days. I want to thank my parents Mr. N. Madasamy and Mrs. M.Manohari for their kind support, the confidence and the love they have shown to me. You have been my greatest strength and I am blessed to be your son. I would also like to give a special thanks to my beloved brother Mr.M.Vasanthkumar for his support to me in all ways. I wish to thank my sister Mrs.S.Arthi and her husband Mr.K.Suresh, for supporting me in all the ways. I would like to thank my father-in-law Mr.P.Velusamy, and mother-in-law Mrs.V. Ponnuthai, without their encouragement and moral support it would have been impossible for me to finish this work. Finally, I would like to give a special thank to my wife Mrs.Sharmiladevi V. She is always there for cheering me up at difficult times with great patience. Without her love and support it would have been impossible for me to finish this work. -ANAND KUMAR M xiv

15 LIST OF FIGURES Figure 1.1 Morphology based Factored SMT for English to Tamil Language Figure 1.2 Reordering of English Language Figure 1.3 Mapping English Word Factors to Tamil Word Factors Figure 1.4 Thesis Organizations Figure 3.1 Maximum Margin and Support Vectors Figure 3.2 Training Errors in Support Vector Machine Figure 3.3 Non-linear Classifier Figure 3.4 Classification of POS Tagging Models Figure 3.5 Two Level Morphology Figure 3.6 Block Diagram of Direct Approach to Machine Translation Figure.3.7 The Vauquios Triangle Figure 3.8 Block Diagram of Transfer Approach Figure 3.9 Block Diagram of EBMT System Figure 3.10 Block Diagram of SMT System Figure 3.11 Rule based Translation System with Post-processing Figure 3.12 Statistical Machine Translation System with Pre-processing Figure 4.1 Example of English Syntactic Tree Figure 4.2 Preprocessing Stages of English Sentence Figure 4.3 Process of Reordering Figure 4.4 English Syntactic Tree Figure 4.5 English to Tamil Alignment Figure 4.6 Block Diagram for Compounding Figure 4.7 Integration Process Figure 5.1 Example of Untagged Corpus xv

16 Figure 5.2 Example of Tagged Corpus Figure 5.3 Untagged Corpus before Pre-editing Figure 5.4 Untagged Corpus after Pre-editing Figure 5.5 Training Data Format Figure 5.6 Implementation of SVMTlearn Figure 5.7 Example Input Figure 5.8 Example Output Figure 5.9 Implementation of SVMTagger Figure 5.10 Implementation of SVMTeval Figure 6.1 Role of Morphological Analyzer in NLP Figure 6.3 General Framework for Morphological Analyzer System Figure 6.4 Preprocessing Steps Figure 6.5 Implementation of Noun/Verb Morph Analyzer Figure 6.6 Structure of Pronoun Word form Figure 6.7 Implementation of Pronoun Morph Analyzer Figure 6.8 Implementation of Proper Noun Morph Analyzer Figure 6.9 Training Data Vs Accuracy Figure 7.1 The Noisy Channel Model to Machine Translation Figure 7.2 Block Diagram for Factored Translation Figure 7.3 Mapping English Factors to Tamil Factors Figure 8.1 Tamil Sentence Generation Figure 8.2 Algorithm for Morphological Generator Figure 8.3 Architecture of Tamil Morphological Generator Figure 8.4 Pseudo Code for Paradigm Classification Figure 8.5 Structure of Pronoun Word form Figure 8.6 Pronoun Morphological Generator xvi

17 Figure 9.1 BLEU-1 Score for Various Models Figure 9.2 BLEU-4 Score for Various Models Figure 9.3 NIST Score for Various Models Figure 9.4 Google Translation System xvii

18 LIST OF TABLES Table 1.1 Factored English Sentences Table 1.2 Compounded English Sentences Table 3.1 Tamil Grammar Table 3.2 Tamil Vowels Table 3.3 Tamil Compound Letters Table 3.4 Ambiguity in Morpheme s Position Table 3.5 An Example to Illustrate the Direct Approach Table 3.6 An Example for Interlingua Representation Table 3.7 An Example for Transfer Approach Table 3.8 Example of English and Tamil Sentences Table 3.9 Scales of Evaluation Table 4.1 POS and Lemma of Words Table 4.2 Reordering Rules Table 4.3 Original and Reordered Sentences Table 4.4 Description of Factors in English Word Table 4.5 Example of English Word Factors Table 4.6 Factored Representation of English Language Sentence Table 4.7 Word forms of English Table 4.8 Content Words of English Table 4.9 Function Words of English Table 4.10 English Word Forms based on Tenses Table 4.11 Tamil Word Forms based on Tenses Table 4.12 Compounding Rules for English Sentence Table 4.13 Average Words per Sentence Table 4.14 Factored English Sentence xviii

19 Table 4.15 Compounded English Sentence Table 4.16 Preprocessed English Sentences Table 5.1 AMRITA POS Tagset Table 5.2 Tag Count Table 5.3 Corpus Statistics Table 5.4 Example of Suitable POS Features for Model Table 5.5 Example of Suitable POS Features for Model Table 5.6 Example of Suitable POS Features for Model Table 5.7 Comparison of Accuracies Table 5.8 Trials and Error Table 5.9 Confusion Matrix Table 6.1 Compound Word-forms Formation Table 6.2 Simple Verb Finite Forms Table 6.3 Noun Case Markers Table 6.4 Minimized POS Tagset Table 6.5 Number of Paradigms and Inflections Table 6.6 Noun Paradigms Table 6.7 Verb Paradigms Table 6.8 Noun Word Forms Table 6.9 Verb Word Forms Table 6.10 Noun Morphemes Table 6.11 Verb Morphemes Table 6.12 Verb/Noun Ambiguous Morphemes Table 6.13 Sample Data Format Table 6.14 Example of Proper Noun Inflections Table 6.15 Tagged Vs Untagged Accuracies Table 6.16 Number of Words and Characters and Level of Efficiencies xix

20 Table 6.17 Sentence Level Accuracies Table 6.18 Preprocessed English and Tamil Sentence Table 7.1 Factored Parallel Sentences Table 8.1 Morpho-phonemic Changes Table 8.2 Simplified POS Tagset Table 8.3 Verb and Noun Word Forms Table 8.4 MLI for Tamil Verb Table 8.5 Look up Table for Paradigm Classification Table 8.6 Paradigms and inflections Table 8.7 Suffix Table Table 8.8 Stemming End Characters Table 9.1 Details of Baseline Parallel Corpora Table 9.2 Details of Factored Parallel Corpora Table 9.3 BLEU and NIST Scores Table 10.1 Mapping of Major Research Outcome to Publications xx

21 LIST OF ABBREVIATIONS ABBREVIATIONS 1PL 1S 2PE 2S 2SE 3PE 3PN 3SE 3SF 3SM 3SN ACC AI AU-KBC BL BLEU CALTS CIIL CLIR CRF CWF EBMT EM EOS FSA FSM FSMT FST FULL FORM First person Plural First person Singular Second person Plural Epicene Second person Singular Second person Singular Epicene Third person Plural Singular Third person Plural Neutral Third person Singular Epicene Third person Singular Feminine Third person Singular Masculine Third person Singular Neutral Accusative Artificial Intelligence Anna University K B Chandrasekhar Base line Bi-Lingual Understudy Centre for Applied Linguistics and Translation Studies Central Institute of Indian Languages Cross lingual information retrieval Conditional Random Fields Compressed Word Format Example based Machine Translation Expectation Maximization End of Sentences Finite State Automata Finite State Machine Factored Statistical Machine Translation Finite State Transducer xxi

22 HMM IBM IE IIIT IR KWIC LDC LSV ManTra MBMA MEMM MG MIRA ML MLI MT NIST NLI NLP NLU PBSMT PCFG PER PLIL PN PNG POS POST QA RBMT Hidden Markov Model International Business Machine Information Extraction International Institute of Information Technology Information Retrieval Key word in context Language data Consortium Letter Successor Varieties MAchiNe assisted TRAnslation Memory based Morphological Analysis Maximum Entropy Markov Models Morphological Generator Margin Infused Relaxed Algorithm Machine Learning Morpho-Lexical Information Machine Translation National Institute of Standards and Technology Natural Language Interface Natural Language Processing Natural Language Understanding Phrase based Statistical Machine Translation Probalistic Context Free Grammar Position Independent Word Error Rate Pseudo Lingual for Indian Languages Proper Noun Person-Number-Gender Part-of-Speech Part-of-Speech Tagging Question Answering Rule based Machine Translation xxii

23 RCILTS SMR SMT SOV SRILM SVM SVO TBL TDIL TER TnT UCSG UN VG WER WFR WSJ WWW Resource Centre for Indian Language Technology Solutions Statistical Machine Reordering Statistical Machine Translation Subject-Object-Verb Stanford Research Institute for Language Modeling Support Vector Machine Subject-Verb-Object Transformation based learning Technology Development for Indian Languages Translation Edit Rate Trigrams n Tagger Universal Clause Structure Grammar United Nations Verb Group Word Error Rate Word Formation Rules Wall Street Journal World Wide Web xxiii

24 ABSTRACT Machine translation is about automatic translation of one natural language text to another using computer. In this thesis, morphology based Factored Statistical Machine Translation system (F-SMT) is proposed for translating sentence from English to Tamil. Tamil linguistic tools such as Part-of-Speech Tagger, Morphological Analyzer and Morphological Generator are also developed as a part of this research work. Conventionally, rule-based approaches are employed for developing Machine Translation. It uses transfer-rules between the source language and the target language for producing grammatical translations. The major drawback of this approach is that it always requires the help of a good linguist for the rule improvement. So, recently datadriven approaches such as example-based and statistical based systems are getting more attention from research community. Currently, Statistical Machine Translation (SMT) systems are playing a major role in developing translation between languages. The main advantage of using Statistical Machine Translation system is that it is language independent and it disambiguates the sense automatically with the use of large quantities of parallel corpora. SMT system considers the translation problem as a machine learning problem. Statistical learning methods perform translation based on large amounts of parallel training data. At first, non-structural information and statistical parameters are derived from the bi-lingual corpora. These statistical parameters are then used for translation. Baseline Statistical Machine Translation system considers only surface forms and does not use linguistic knowledge of the languages. Therefore its performance is better for similar language pair when compared to the dissimilar language pair. Translating English into morphologically rich languages is a challenging task. Because of the highly rich morphological nature of Tamil language, a simple lexical mapping alone does not help for retrieving and mapping all the morphological and syntactic information from the English language sentences. Tamil word forms are productive, that is, word forms are written without spaces. Inflected forms of Tamil words are seperate words in Tamil. This leads to the problem of sparse data. It is very difficult to collect or create a parallel corpus which contains all the possible Tamil surface words. Because, a single Tamil root verb is xxiv

25 inflected into more than ten thousand different forms. Moreover, selecting a correct Tamil word or phrase during translation is a challenging job. The corpus size and quality decides the accuracy of the Machine Translation system. The limited availability of parallel corpora for English-Tamil language and high inflectional variation increases the data sparseness problem for baseline phrase-based SMT system. While translating from English to Tamil language, the SMT baseline system will not generate the Tamil word forms that are not present in the training corpora. The proposed Machine Translation system is based on factored Statistical Machine Translation models. The words are factored into lemma and inflected forms based on their part of speech. This factorization reduces the data sparseness in decoding. Factored translation models allow the integration of the linguistic information into a phrase-based translation model. These linguistic features are treated as separate tokens during the factored training process. Baseline SMT system uses untagged corpora for training, whereas factored SMT uses linguistically factored corpora. Pre-processing phase allows including language specific knowledge into the parallel corpus indirectly. In preprocessing, bi-lingual corpora are converted into factored bi-lingual corpora using linguistic tools and reordering rules. Similarly, Tamil language sentences are also pre-processed using the proposed linguistic tools like POS tagger and Morphological analyzer. These factored corpora are then given to the Statistical Machine Translation models for training. Finally, Tamil morphological generator is used for generating a surface word from output factors. xxv

26 CHAPTER 1 INTRODUCTION 1.1 GENERAL Machine Translation is an automatic translation of one natural language text to another using computer. Initial attempts for Machine Translation made in 1950 s didn t meet with success. Now internet users need a fast automatic translation system between languages. Several approaches like Linguistic based and Interlingua based systems are used to develop a machine translation system. But currently, statistical methods dominate the machine translation field. Statistical Machine Translation (SMT) approach draws knowledge from automata theory, artificial intelligence, data structure and statistics. SMT system treats translation as a machine learning problem. This means that a learning algorithm is applied to a large amount of parallel corpora. Parallel corpora are sentences in one language along with its translation. Learning algorithms create a model from parallel sentences and using this model, unseen sentences are translated. If parallel corpora are available for a language pair then it is easy to build a bilingual SMT system. The accuracy of the system is highly dependent on the quality and quantity of the parallel corpus and the domain. These parallel corpora are constantly growing. Parallel corpora are the fundamental resource for SMT system. Parallel corpora are available from government s bi-lingual text books, news papers, websites and novels. SMT models are giving good accuracy for language pairs, particularly for similar languages in specific domains or languages that have large availability of bi-lingual corpora. If a sentence in language pair is not structurally similar, then the translation patterns are difficult to learn. Huge amounts of parallel corpora are required for learning the pattern, therefore statistical methods are difficult to use in less resourced languages. To enhance the translation performance of dissimilar language pairs and less resourced languages, an external preprocessing is required. This preprocessing is performed using linguistic tools. In SMT system, statistical methods are used for mapping of source language phrases into target language phrases. Statistical model parameters are estimated from bi-lingual and mono-lingual corpora. There are two models in the SMT system. They 1

27 are Translation model and Language model. The translation model takes parallel sentences and finds the translation hypothesis between the phrases. Language model is based on the statistical properties of n-grams. It uses the monolingual corpora. Several translation models are available in SMT system. Some important models are phrase based model, syntax based model and factored model. Phrase Based Statistical Machine Translation (PBSMT) is limited to the mapping of small text chunks. Factored translation model is an extension of phrase based models. It integrates linguistic information at the word level. This thesis proposes a pre-processing method that uses linguistic tools to the development of English to Tamil machine translation system. In this translation system, external linguistic tools are used to augment the linguistic information into the parallel corpora. The pre and post processing methodology proposed in this thesis are applicable to other language pairs too. 1.2 OVERVIEW OF MACHINE TRANSLATION Machine translation is one of the major oldest and the most active area in natural language processing. The word translation refers to transformation of text or speech from one language into other. Machine translation can be defined as, the application of computers to the task of translating texts from one natural language to another. It is a focussed field of research in linguistic concepts of syntax, semantics, pragmatics and discourse. Today a number of systems are available for producing translations, though they are not perfect. In the process of translation, which is either carried out manually or automated through machines, the context of the text in the source language when translated must convey the exact context in the target language. Translation is not just word level replacement. A translator, either a machine or human, must interpret and analyse all the elements in the text. Also human/machine should be familiar with all the issues during the translation process and must know how to handle it. This requires indepth knowledge in grammar, sentence structure, meanings, etc and also an understanding in each language s culture in order to handle idioms and phrases originated from different culture. The cross culture understanding is an important issue that holds the accuracy of the translation. 2

28 It will be a great challenge for humans to design automatic machine translation system. It is difficult for translating sentences by taking into consideration all the required information. Humans need several revisions to make the perfect translation. No two individual human translators can generate identical translations of the same text in the same language pair. Hence it will be a greater challenge for humans to design a fully automated machine translation system to produce high quality translations. 1.3 ROLE OF MACHINE TRANSLATION IN NLP Natural Language Processing (NLP) is the field of computer science devoted to the development of models and technologies enabling computers to use human languages both as input and output [1]. The ultimate goal of NLP is to build computational models that equal human performance in the task of reading, writing, learning, speaking and understanding. Computational models are useful to explore the nature of linguistic communication as well as for enabling effective human-machine interaction. Jurafsky and Martin (2005) [2] describe Natural Language Processing as computational techniques that process spoken and written human language as language. According to the Microsoft researchers, the goal of the Natural Language Processing (NLP) is to design and build software that will analyze, understand and generate languages that humans use naturally, so that eventually one will be able to address their computer like addressing another person. Machine Translation is used for translating texts for assimilation purpose which aids bilingual or cross-lingual communication and also for searching, accessing and understanding foreign language information from databases and web-pages [3]. In the field of information retrieval a lot of research is going on in Cross-Language Information Retrieval (CLIR), i.e. information retrieval systems capable of searching databases in many different languages [4]. Construction of robust systems for speech-to-speech translation to facilitate crosslingual oral communication has been the dream of speech and natural language researchers for decades. Machine translation is an important module in speech translation systems. Currently, computer assisted learning plays a major role in academic environment. The use of Machine Translation in language learning has not yet got enough attention because of poor quality of automatic translation output. Using 3

29 good automatic translation system, students can improve their translation and writing skills. Such system can break the language barriers of students and language learners. 1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM Traditionally, rule based approaches are used to develop a machine translation system. Rule based approach feeds the rules into machine using appropriate representations. Feeding all linguistic knowledge into a machine would be very hard. In this context, the statistical approach to Machine Translation has some attractive qualities that made it the preferred approach in machine translation research over the past two decades. Statistical translation models learn translation patterns directly from data, and generalize them to translate a new text. The SMT approach is largely languageindependent, i.e. the models can be applied to any language pair. System based on statistical methods is much better than the traditional rule-based systems. In SMT, implementation and development times are much shorter. SMT can improve by coupling new models for reordering and decoding. It only needs to learn parallel corpora for generating a translation system. In contrast, rule based system needs transfer rules which only linguistic experts can generate. These rules are entirely dependent on language pair involved and defining general transfer-rules is not an easy task, especially for languages with different structures [5]. SMT system can be developed rapidly if the appropriate corpus is available. A Rule Based Machine Translation (RBMT) system requires a lot of development and customization costs until it reaches the desired quality threshold. Packaged RBMT systems have been already developed and it is extremely difficult to reprogram models and equivalences. Above all, RBMT has a much longer process involving more human resources. RBMT system is retrained by adding new rules and vocabulary among other things [5]. Statistical Machine Translation works well for translations in a specific domain with the engine trained with bilingual corpus in that domain. A SMT system requires more computing resources in terms of hardware to train the models. Billions of calculations need to take place during the training of the engine and the computing knowledge required for it is highly specialized. However, training time can be reduced 4

30 nowadays thanks to the wider availability of more powerful computers. RBMT requires a longer deployment and compilation time by experts so that, in principle, building costs are also higher. SMT generates statistical patterns automatically, including a good learning of exceptions to rules. As regards to the rules governing the transfer of RBMT systems, certainly they can be seen as special cases of statistical standards. Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT systems can be upgraded with syntactic information and even semantics, like the RBMT. A SMT engine can generate improved translations if retrained or adapted again. In contrast, the RBMT generates very similar translations after retraining [5]. SMT systems, in general, have trouble in handling the morphology on the source or the target side especially for morphologically rich languages. Errors in morphology can have severe consequences on meaning of the sentence. They change the grammatical function of words or the interpretation of the sentence through the wrong verb tense. Factored translation models try to solve this issue by explicitly handling morphology on the generation side. Another advantage of Statistical Machine Translation system is that, it generates a more natural or closer to the literal translation of the input sentence. Symbolic approaches to machine translation take great human effort in language engineering. In knowledge based machine translation, for example, designers must first find out what kinds of linguistic, general common-sense and domain-specific knowledge is important for a task. Then they have to design an Interlingua representation for the knowledge and write grammars to parse input sentences. Output sentences are generated using the Interlingua representation. All of these require expertise in language technologies and it requires tedious and laborious work. The major advantage of Statistical Machine Translation system is its learnability. As long as a model is set up, it can learn automatically with well-studied algorithms for parameter estimation. Therefore parallel corpus replaces the human expertise for the task. The coverage of grammar is also one of the serious problems in rule based system. Statistical Machine Translation system is a good candidate that meets these criteria. It can learn to have a good coverage as long as the training data is representative enough. It can statistically model the noise in spoken language, so it does not have to make a binary keep/abandon decision and is therefore more robust to noisy data [5]. 5

31 1.5 MOTIVATION OF THE THESIS Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. Even though machine translation was envisioned as a computer application in the 1950 s, machine translation is still considered to be an open problem [3]. The demand for machine translation is growing rapidly. As multilingualism is considered to be a part of democracy, the European Union funds EuroMatrixPlus [6], a project to build machine translation system for all European language pairs, to automatically translate the documents to its 23 official languages, which were being translated manually. Also as the United Nations (UN) is translating a large number of documents into several languages, the UN has created bilingual corpora for some language pairs like Chinese English, Arabic English which are among the largest bilingual corpora distributed through the Linguistic Data Consortium (LDC). In the World Wide Web, as around 20% of web pages and other resources are available in their national languages. Machine Translation can be used to translate these web pages and resources to the required language in order to understand the content in those pages and resources, thereby decreasing the effect of language as a barrier of communication [7]. In a linguistically diverse country like India, machine translation is a very essential technology. Human translation is widely prevalent in India since ancient times which are evident from the various works of philosophy, arts, mythology, religion and science which have been translated among ancient to modern Indian languages. Also, numerous classic works of art, ancient, medieval and modern, have also been translated between European and Indian languages since the 18th century. As of now, human translation in India finds application mainly in the administration, media and education and to a lesser extent in business, arts and science and technology [8]. India has 18 constitutional languages, which are written in 10 different scripts. Hindi is the official language of the India. English is the language which is most widely used in the media, commerce, science and technology and education. Many of the states have their own regional language, which is either Hindi or one of the other constitutional languages. 6

32 In such a situation, there is a big market for translation between English and the various Indian languages. Currently, the translation is done manually. Use of automation is largely restricted to word processing. Two specific examples of high volume manual translation are translation of news from English into local languages, translation of annual reports of government departments and public sector units among English, Hindi and the local language. Many resources such as news, weather reports, books, etc., in English are being manually translated to Indian languages. Of these, News and weather reports from all around the world are translated from English to Indian languages by human translators more often. Human translation is slow and also consumes more time and cost compared to machine translation. It is clear from this that there is large market available for machine translation rather than human translation from English into Indian languages. The reason for choosing automatic machine translation rather than human translation is that machine translation is faster and cheaper than human translation. Tamil, a Dravidian language, is spoken by around 72 million people and has the official status in the state of Tamilnadu and Indian union territory of Puducherry. Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken by significant minorities in Malaysia and Mauritius as well as emigrant communities around the world. It is one of the 22 scheduled languages of India and declared a classical language by the government of India in 2004 [9]. In this thesis a methodology for English to Tamil Statistical Machine Translation is proposed, along with a pre-processing technique. This pre-processing method is used to handle morphological variance between English and Tamil. Linguistic tools are developed to generate linguistically motivated data for the factored translation model for English-Tamil. 1.6 OBJECTIVE OF THE THESIS The main aim of this research is to develop a morphology based prototype Statistical Machine Translation system for English to Tamil language by integrating different linguistic tools. This research will also address the issue of how the morphologically correct sentence is generated when translating from a morphologically simple language into a morphologically rich language. The objective of the research is detailed as follows: 7

33 Develop a pre-processing module (Reordering, Compounding and Factorization) for English language sentence to transform the structure to more similar to that of Tamil. The pre-processing module for source language includes three stages, which are reordering, factorization and compounding. In reordering stage, the source language sentence is to be syntactically reordered according to the Tamil language syntax. After reordering, the English words will be factored into lemma and other morphological features. It will be followed by the compounding process, in which the various function words are removed from the reordered sentence and attached as a morphological factor to the corresponding content word. Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a sentence. Tamil POS tagger is going to develop using Support Vector Machine (SVM) based machine learning tool. POS annotated corpus will be created for training the automatic tagger system. Develop a Morphological Analyser to segment the Tamil surface word into linguistic factors. Morphological analyzer system is to be developed using machine learning approach. POS tagger and morphological analyser tools are to be used for preprocessing the Tamil language sentence. Linguistic information from the tools is to be incorporated to the surface words before SMT training. Build a Morphology based prototype Factored Statistical Machine Translation (F-SMT) system for English to Tamil. After pre-processing, the bi-lingual sentences are to be created and transformed as factored bi-lingual sentences. Monolingual corpora for Tamil are collected and factored using Tamil POS tagger and morphological analyser. These sentences will be used for training the factored Statistical machine translation model. 8

34 Develop a Tamil Morphological Generator system to generate Tamil surface word form. Morphological generator transforms the translation output into grammatically correct target language sentence. Morphological generator is used in post processing module for English to Tamil machine translation system. 1.7 RESEARCH METHODOLOGY Overall System Architecture Tamil is a morphologically rich language with free word-order of Subject-Object- Verb (SOV) pattern. English language is morphologically simple with a fixed word order of Subject-Verb-Object (SVO) pattern. The baseline SMT system would not perform well for the languages with different word order and disparate morphological structure. For resolving this, factored models are introduced in SMT system. The factored model, which is a subtype of SMT system, will allow multiple levels of representation of the word-from the most specific level to more general levels of analysis such as lemma, part-of-speech and morphological features [10]. Figure 1.1 shows the overall architecture of the proposed English to Tamil SMT system. The preprocessing module is externally attached to the factored SMT system. This module converts bilingual corpora into factored bi-lingual corpora using morphology based linguistic tools and reordering rules. After preprocessing, the representations of source language sentence syntax closely follow the sentence structure of target language. This transformation decreases the complexity in alignment, which is also one of the key problems in baseline SMT system. Parallel corpora are used to train the statistical translation models. Parallel corpora are created and converted into factored parallel corpora using preprocessing. English sentences are factored using Stanford Parser tool and Tamil sentences are factored using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected from various news papers and factored using Tamil linguistic tools. This mono-lingual corpus is used in language model. Finally, in post-processing, Tamil morphological generator is used for generating a surface word from output factors. 9

35 Figure 1.1 Morphology based Factored SMT for English to Tamil language Details of Pre-processing English Language Sentence Machine Translation system for language pair with disparate morphological structure needs appropriate pre-processing or modeling before translation. The preprocessing can be performed on the raw source language sentence to make it more appropriate for translating into target language sentence. The pre-processing module for English language sentence consistss of reordering, factorization and compounding Reordering English Language Sentence Reordering means, rearrange the word order of source language sentence into a word order that is closer to that of the target language sentence. It is an important process for languages which differs in their syntactic structure. English and Tamil language pair has disparate syntactic structure. English word order is Subject-Verb- Object (SVO) whereas Tamil word order is Subject-Object-Verb (SOV). For example, the main verb of a Tamil sentence always comes at the end but in English it comes between subject and object [11]. English syntactic relations are retrieved from the Stanford Parser tool. Based on reordering rules source language sentencee is reordered. 10