Shiva and Shakti English-Hindi Machine Translation Systems

Similar documents
Overview of MT techniques. Malek Boualem (FT)

The history of machine translation in a nutshell

BILINGUAL TRANSLATION SYSTEM

Learning Translation Rules from Bilingual English Filipino Corpus

Comprendium Translator System Overview

The history of machine translation in a nutshell

A Machine Translation System Between a Pair of Closely Related Languages

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

Application of Natural Language Interface to a Machine Translation Problem

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Introduction. Philipp Koehn. 28 January 2016

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

Context Grammar and POS Tagging

Semantic analysis of text and speech

4.1 Multilingual versus bilingual systems

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Semantic annotation of requirements for automatic UML class diagram generation

Maskinöversättning F2 Översättningssvårigheter + Översättningsstrategier

Natural Language to Relational Query by Using Parsing Compiler

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

Hybrid Strategies. for better products and shorter time-to-market

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Statistical Machine Translation

Customizing an English-Korean Machine Translation System for Patent Translation *

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Processing: current projects and research at the IXA Group

Special Topics in Computer Science

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Language and Computation

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

Natural Language Database Interface for the Community Based Monitoring System *

Identifying Focus, Techniques and Domain of Scientific Papers

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Machine Translation Approaches and Survey for Indian Languages

ONLINE ENGLISH LANGUAGE RESOURCES

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Building a Question Classifier for a TREC-Style Question Answering System

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Hybrid Machine Translation Guided by a Rule Based System

AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS

Paraphrasing controlled English texts

An English-to-Arabic Prototype Machine Translator for Statistical Sentences

A Mixed Trigrams Approach for Context Sensitive Spell Checking

CALICO Journal, Volume 9 Number 1 9

Improving statistical POS tagging using Linguistic feature for Hindi and Telugu

LINGUISTIC PROCESSING IN THE ATLAS PROJECT

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

The Role of Sentence Structure in Recognizing Textual Entailment

ISA OR NOT ISA: THE INTERLINGUAL DILEMMA FOR MACHINE TRANSLATION

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Extracting translation relations for humanreadable dictionaries from bilingual text

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

31 Case Studies: Java Natural Language Tools Available on the Web

Opentrad: bringing to the market open source based Machine Translators

Collecting Polish German Parallel Corpora in the Internet

Classification of Natural Language Interfaces to Databases based on the Architectures

Master of Arts in Linguistics Syllabus

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Automation of Translation: Past, Presence, and Future Karl Heinz Freigang, Universität des Saarlandes, Saarbrücken

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

TRANSLATING POLISH TEXTS INTO SIGN LANGUAGE IN THE TGT SYSTEM

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

How to make Ontologies self-building from Wiki-Texts

Interlingua-based Machine Translation Systems: UNL versus Other Interlinguas

Annotation Guidelines for Dutch-English Word Alignment

PROMT Technologies for Translation and Big Data

Interactive Dynamic Information Extraction

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

The Oxford Learner s Dictionary of Academic English

Oxford Dictionary of Current Idiomatic English

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

Prototype Machine Translation System From Text-To-Indian Sign Language

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Albert Pye and Ravensmere Schools Grammar Curriculum

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

Example-based Translation without Parallel Corpora: First experiments on a prototype

An Overview of Applied Linguistics

National Quali cations SPECIMEN ONLY

The Prolog Interface to the Unstructured Information Management Architecture

Word Completion and Prediction in Hebrew

Machine Translation as a translator's tool. Oleg Vigodsky Argonaut Ltd. (Translation Agency)

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

MACHINE TRANSLATION FROM ENGLISH TO ARABIC

English Appendix 2: Vocabulary, grammar and punctuation

Computer Standards & Interfaces

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages

Transcription:

Shiva and Shakti English-Hindi Machine Translation Systems T.Papi Reddy papi reddy@students.iiit.net International Institute of Information Technology-Hyderabad,AP conference name p.1/30

Introduction A Brief History Different kinds of approaches Shiva Shakti Problems in MT Lexical Resources needed for MT Conclusion conference name p.2/30

A Brief History Beginning (1950 s) Dark period (1964-1979) Re-Birth (1980) A new paradigm (1990) MT in India conference name p.3/30

Beginning (1950 s) Very optimistic attitude towards MT Very high expectations Mainly from English to Russian and vice-versa Successfull demonstration of an MT system with restricted vocabulary and grammar by Georgetown university and IBM in 1956 Attracted huge funding inspired many other MT projects Scaling up of restricted vocabulary and grammar did not produce satisfactory results (infact deteriorated the system s performance) conference name p.4/30

Dark period (1964-1979) Report of Automatic Language Processing Advisory Committe (ALPAC) MT not possible and practical in near future! :( Funding for MT stopped in US However research continued in canada and a very few other places. conference name p.5/30

Re-Birth (1979) Installation of Taum-Meteo system for translation of weather bulleins from English to French in Canada in 1979 Limited subject domain produced good translation :) Japanese MT projects Mu project at Kyoto Univ (1982) Eurotra MT project in Europe for many European languages Demand came from different sources conference name p.6/30

New Paradigm (1990) Candide: A purely statistical MT system from IBM Usage of corpus based approaches in Japanese MT systems Focus changed from purely research to practical applications By the end of 90 s, huge growth in usage of MT systems for different purposes conference name p.7/30

MT in India MT was taken seriously only in 90 s Anusaaraka (language accessor) among Indian languages built at IITK and University of Hyderabad English to Hindi MT systems NCST CDAC IITK SHIVA - CMU, IISC-B, LTRC IIIT SHAKTI - LTRC IIIT conference name p.8/30

Different approaches Rule-based : A lingust or a language expert gives necessary rules to generate translation of source language into a target language Corpus-based : Machine tries to learn the rules necessary for translation from a parallel corpus. Statistical Example-based Ex: Shiva Hybrid approach A combination of both the above approaches Ex: Shakti conference name p.9/30

Shiva Example-based Requires huge amounts of parallel corpora Learns by associating word sequences in English with word sequences in Hindi English-Hindi parallel corpora not readily available! :( Controlled corpora phrasal and sentence level translations conference name p.10/30

Shakti Hybrid system Uses both the rule-based and corpus-based approaches In addition to the rules provided by the language expert, it uses corpus based techniques in translation conference name p.11/30

Architecture Source language(english) dependent modules English sentence English sentence Analysis parsed output WSD senses marked conference name p.12/30

Architecture Bilingual modules parsed output Transfer Grammar reordered output English-Hindi Dictornary lookup Hindi words substituted TAM lookup Hindi TAM substituted conference name p.13/30

Architecture Source Language (Hindi) sentence Synthesis representation in Hindi vibhakti Insertion vibhaktis inserted Agreement Word form generation approriate word forms generated Hindi sentence conference name p.14/30

English Sentence Analysis Morphological analyzer root and other features of each word more than one analsysis possible went (root : go, lex_cat : v) saw (root : see, lex_cat : v) or saw (root : saw, lex_cat : v) Parts of speach (POS) tagger single parts of speech tag for each word uses statistical language model conference name p.15/30

English Sentence Analysis Chunker verb groups noun phrases Ex: Godhra_FW (( is_vbz simmering_vbg )) with_in [[ anger_nn ]] Parser sentence analysis intermediate representation conference name p.16/30

Bilingual Analysis Transfer Rules Result is a re-ordered intermediate representation for Hindi. Word substitution from Bilingual dictionary. sense taken into account TAM substitution from Bilingual TAM dictionary. conference name p.17/30

Transfer grammar Reordering rules (Ram) 1 (saw) 2 (the girl) 3 (in the garden) 4 (Ram) 1 (the girl) 3 (the garden in) 4 (saw) 2 Mirror principle Verb Frames (TransLexGram) verb : meet verb frame English : A meets B Hindi : A B sao milaa Machine learned rules very specific rules less coverage conference name p.18/30

Tense Aspect Modality(TAM) Tense Aspect Modality(TAM) Ram is going home TAM : is_ing Hindi TAM : 0_ rha ho conference name p.19/30

Engineering Principle Non-lexical grammar: 6 rules mirror principle Category based grammar : 60 verb classes Lexicalized grammar : 6000 verbs TransLexGram Fine grained grammar : 60,000 rules learn from a corpus conference name p.20/30

Hindi Sentence Synthesis Vibhakti Insertion rule based if verb = transitive and tense = past then subj + nao corpus based Agreement with in a chunk across the chunk conference name p.21/30

Hindi Sentence Synthesis Agreement of verb subject if no vibhakti else object if no vibhakti else neutral form (masculine, singular, third person) Word form generator takes the root word and its features gives appropriate word form jaa features : v f s 3 0_ raha ho jaa rhi ho conference name p.22/30

Problems in MT Multiple Word Senses Syntactic and Semantic ambiguities Multiple analysis of a single sentence Preposition disambiguation Different sentence structures Idioms conference name p.23/30

Multiple Word Senses bank money bank? river bank? Ex: I deposited hundred rupees in the bank mao nao baomk maom saý $Payao jamaa ikyao conference name p.24/30

Mutliple Word Senses Ex: She ate lunch in the bank ]sanao baomk maom daopahr ka #aanaa #aayaa Ex: She ate lunch on the bank ]sanao iknaara Par daopahr ka #aanaa #aayaa Uses WASP for word sense disambiguation conference name p.25/30

Mutliple Word Senses had Ex: She had lunch in the restaurant ]sanao jalapaana-garh maom daopahr ka #aanaa #aayaa Ex: She had tea in the restaurant ]sanao jalapaana-garh maom caaya ipayaa conference name p.26/30

Different Setence Structures Questions What, Where, When... Yes/No relative clauses conference name p.27/30

Lexical Resources Bilingual dictionary TAM dictionary TransLexGram Transfer lexicon (Bilingual dictionary) Transfer grammar verb frames Example parallel sentences Phrazal dictionary conference name p.28/30

Lexical Resources Idiom dicionary Wordnet (indian languages) Word form generator Annotated corpora POS tagging tree banks Transfer rules conference name p.29/30

Conclusion It is necessary for the linguists and language experts to work hand in hand. Only then we can build practical systems conference name p.30/30