Mohammed Mahmoud Abu Shquier. Khaled M. Al-Howiti



Similar documents
An English-to-Arabic Prototype Machine Translator for Statistical Sentences

MACHINE TRANSLATION FROM ENGLISH TO ARABIC

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Overview of MT techniques. Malek Boualem (FT)

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Natural Language Database Interface for the Community Based Monitoring System *

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Comprendium Translator System Overview

Natural Language to Relational Query by Using Parsing Compiler

Word Completion and Prediction in Hebrew

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Learning Translation Rules from Bilingual English Filipino Corpus

Paraphrasing controlled English texts

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

Brill s rule-based PoS tagger

Customizing an English-Korean Machine Translation System for Patent Translation *

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

A Machine Translation System Between a Pair of Closely Related Languages

PTE Academic Preparation Course Outline

Interpreting areading Scaled Scores for Instruction

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Building a Question Classifier for a TREC-Style Question Answering System

Ling 201 Syntax 1. Jirka Hana April 10, 2006

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

BILINGUAL TRANSLATION SYSTEM

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

FUNCTIONAL SKILLS ENGLISH - WRITING LEVEL 2

Index. 344 Grammar and Language Workbook, Grade 8

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

SYNTAX: THE ANALYSIS OF SENTENCE STRUCTURE

Third Grade Language Arts Learning Targets - Common Core

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages CHARTE NIVEAU B2 Pages 11-14

Albert Pye and Ravensmere Schools Grammar Curriculum

Hybrid Machine Translation Guided by a Rule Based System

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Statistical Machine Translation

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Introduction. Philipp Koehn. 28 January 2016

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

31 Case Studies: Java Natural Language Tools Available on the Web

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

English Appendix 2: Vocabulary, grammar and punctuation

Semantic annotation of requirements for automatic UML class diagram generation

PoS-tagging Italian texts with CORISTagger

Overview of the TACITUS Project

ONLINE ENGLISH LANGUAGE RESOURCES

Interactive Dynamic Information Extraction

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Collecting Polish German Parallel Corpora in the Internet

The Lexicon. The Lexicon. The Lexicon. The Significance of the Lexicon. Australian? The Significance of the Lexicon 澳 大 利 亚 奥 地 利

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Section 8 Foreign Languages. Article 1 OVERALL OBJECTIVE

Text-To-Speech Technologies for Mobile Telephony Services

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Writing Common Core KEY WORDS

Application of Natural Language Interface to a Machine Translation Problem

Context Grammar and POS Tagging

Syntactic Theory on Swedish

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

Focus: Reading Unit of Study: Research & Media Literary; Informational Text; Biographies and Autobiographies

This image cannot currently be displayed. Course Catalog. Language Arts Glynlyon, Inc.

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Maskinöversättning F2 Översättningssvårigheter + Översättningsstrategier

KINDGERGARTEN. Listen to a story for a particular reason

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

Turkish Radiology Dictation System

Comparative Analysis on the Armenian and Korean Languages

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

ESL 005 Advanced Grammar and Paragraph Writing

CALICO Journal, Volume 9 Number 1 9

National Quali cations SPECIMEN ONLY

Data Pre-Processing in Spam Detection

Language Arts Literacy Areas of Focus: Grade 6

A Knowledge-based System for Translating FOL Formulas into NL Sentences

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Transcription:

Fully Automated Arabic to English Machine Translation System: Transfer-based approach of AE-TBMT. Mohammed Mahmoud Abu Shquier Department of Information Technology, University of Tabuk, Tabuk, KSA Fax: +96644248363 E-mail:mabushquier@ut.edu.sa Khaled M. Al-Howiti Department of Information Science, University of Tabuk, Tabuk, KSA Fax: +96644248363 E-mail: alhowity@hotmail.com Abstract: Arabic Machine Translation (MT) has been widely studied recently. Any Arabic to English Machine Translation (MT) system should be capable of dealing with word order and agreement requirements, agreement rules are crucial for the generation of sentences in the target language. They also serve as rules for the ordering of sentence constituents. Transfer-based technique is currently one of the most widely used methods of machine translation. The idea behind this method is to have an intermediate representation that captures the meaning of the original sentence in order to generate the correct translation. In this paper we have explored several features of Arabic pertinent to MT. The hypothesis under investigation and main aims of this paper are to build a robust lexical Machine Translation (MT) system that will accept Arabic source sentences (SL) and generate English sentences as a target language (TL), and to examine how the challenges imposed by this particular language pair are tackled. The paper represents as well a starting point for the future implementation of a successful Arabic MT engine. The conducted experiment proves that our system (AE-TBMT) has scored the highest percentage by 96.6 percent, this means that only three percent of the entire test examples have not been handled correctly, and this result is considered fair if not good, as the other three systems score below that mark. Keywords: Machine Translation; Transfer-based approach; Arabic Natural Language Processing, AE-TBMT. Biographical notes: Dr. Mohammed M. Abu Shquier is currently acting as an Assistant professor at the University of Tabuk, KSA. Before that he worked as a head of Computer Science Dept. at the University of Jerash, Jordan. He also worked for two years, as a senior lecturer at the American Degree Program (ADP) at Taylors University, Malaysia. Abu Shquier received his Masters degrees from the school of Computer Science at The University of Science Malaysia (USM), and he received his Ph.D from the National University of Malaysia (UKM). Abu Shquier research focuses on developing novel Arabic Machine Translation, He also interested in conducting research in the areas of Computational Linguistics, Informaiton Retrieval and Arabic Natural Language Processing in general. In addition to that he had conducted good research in the area of Arabic Morphology, Syntactic and Symantic. He has published a number of papers in excellent international journals and conference. Abu Shquier has more than 8 years of experience in teaching under-graduate and post-graduate students with different majors 1 Introduction Arabic is a Semitic language spoken by more than 330 million people as a native language extending from the Arabian Gulf in the East to the Atlantic Ocean in the West. Moreover, it is the language in which 1.4 billion Muslims around the world perform their daily prayers (12). At the level of morphology, Arabic is a templatic, inflectional and derivational language (Al-Amoudi et al., 2013; Albared et al., 2010; 2011a; Mohammed and Aziz, 2011) (4) (19) (20) (5). At the level of syntax, Arabic is considered as a subject prodrop language That is, every inflection in a verb paradigm is specified uniquely and need not to use independent pronouns to differentiate the person, number, and gender of the verb. Arabic words are often ambiguous in their morphological analysis (Al-Sughaiyer and Al-Kharashi 2004) (8). As a natural language, Arabic is rich in morphological and syntactic structures. Arabic is also challenging in that it is a derivational or constructional language rather than a concatenative one. Arabic has relatively free world order, mainly, nominal Sentence Subject-Verb-Object (SVO) and Verbal Sentence Verb-Subject-Object (VSO). However, the Copyright c 2014 Inderscience Enterprises Ltd.

2 Mohammed M. Abu Shquier. et al. default sentence structure is (SVO). The version of Arabic we consider in this paper is Modern Standard Arabic (MSA). According to Birch et al. (2009) (2), Arabic Language has several distingishing features that help in the translation process, the list below shows some of these features: 1. Arabic is written from right to left in a horizontal form. 2. Arabic writing sits on the line. 3. There are no capital letters in Arabic. 4. Punctuation is similar to English except for comas which sit on the line instead of under the line. 5. Arabic uses gender for all known nouns, no neutral ones. 6. Space is left between words in a sentence. 7. Some letters change shape depending on whether they are at the beginning, in the middle or at the end of the word. 8. There are 29 letters in Arabic with 3 letter sounds which do not even exist in the English language. 9. Arabic does not distinguish between vowels and consonants; the use of diacratics (a small sign on the top or under the letter) indicates the pronunciation. According to Habash et al. (2009) (23), The Arabic-English language pair is known to behave more monotone than other language pairs, e.g. (Urdu-English or Chinese-English). In Arabic, all nouns are categorized into either feminine or masculine, hence, there is no neutral, and the gender can be either grammatical or natural. The gender of inanimate objects is grammatical, Animate objects have a natural gender, and this gender can be either non-productive or productive. The non-productive gender is the case of nouns where the feminine and the masculine have different lexical entries, i.e., the feminine is not derived from the masculine. By contrast, in the productive gender, the feminine is derived from the masculine, usually by adding a special suffix ta marbuta ( ) to the end of the masculine form (27). To successfully conduct the process of translation, human translators need to have three types of knowledge. The first knowledge of the source language (lexicon, morphology, syntax and semantics) in order to understand the meaning of the source text. Second type is the knowledge of the target language (lexicon, morphology, syntax and semantics) in order to produce a comprehensible, acceptable and wellformed text. The third type is the knowledge of the subject matter. This enables the translator to understand the specific and contextual between source and target language so as to be able to transfer lexical items and syntactic structures of the source language to the best matches in the target language (3). 2 Related Work During the last three decades, Several approaches have been proposed for traslatig Arabic to and from other spoken languages as Arabics morphology poses both a challenge and an opportunity to MT researchers, some of these approaches using rules and grammars, other approaches relied on statistical methods. Nazlia Omar et. al., (2010,2012) (24) (28) developed an Arabic to English Machine Translation for both noun and verb phrases using transfer-based approaches, for the noun phrases MT they have managed to perform the syntactic reordering for this language pair, they achieved reasonable improvements in translation quality over related approach, Their method was tested on 88 thesis titles and journals from the computer science domain. The accuracy of their result was 94.6%. while in the verb phrase MT system their study was to introduce Verbal Sentence rule based Machine Translation, Their system was trained on 45 verbal sentences from different Arabic scientific text and tested on 30 new verbal sentences from different domains. they tested their system against two other machine translation systems namely Systran and Google. The accuracy of the result was 93%. Salem et al. (2008) (26) developed an Interlingual rulebased approach to translate from Arabic to English called UniArab, which is based on the Role and Reference Grammar Linguistic Model (RRG), they used the representation and the logical structure of an Arabic sentence. Their aim was to explore how the characteristics of the Arabic language will effect the development of a Machine Translation (MT) tool from Arabic to English. 3 Challenges of Arabic to English MT Arabic is a highly agglutinative language with a rich set of suffixes. Its inflectional and derivational productions introduce a big growth in the number of possible word forms (9). In Arabic, articles, prepositions, pronouns, etc. can be affixed to adjectives, nouns, verbs and particles to which they are related. The richness in morphology introduces many challenges to the translation problem to and from Arabic. (Khemakhem et. al., 2010) (9) mentioned that the divergence of Arabic and English language pair puts a rocky barrier in building a prosperous machine translation system. Morphological and syntactic preprocessing is important in order to converge this language pair. Arabic words can often be ambiguous due to the tri-literal root system. This system allows the language to evolve and cover a wide range of meanings. In some derivations, one or more of the root letters is dropped, resulting in possible ambiguity. Arabic has a large set of morphological features (Al-Sughaiyer and Al-Kharashi, 2004) (8). These features are in the form of prefixes, suffixes and also infixes that can entirely change the

Fully Automated Arabic to English Machine Translation System: Transfer-based appraoch of AE-TBMT. 3 meaning of the word. Moreover, Arabic has a relatively free word order, this poses another significant challenge to MT due to the vast possibilities to express the same sentence in Arabic. 4 System Design and Architecture In his interesting paper, (Shaalan, 2010) (12), stated that the translation process with transfer approach is decomposed into three steps: analysis, transfer, and generation. In the analysis step, the input sentence is analyzed syntactically (and in some cases semantically) to produce an abstract representation of the source sentence. In the transfer step, this representation is transferred into a corresponding representation in the target language; In the generation step, the target-language output is produced. The (morphological and syntactic) generator is responsible for polishing and producing the surface structure of the target sentence. (English), However. the system involves the following steps. 5 Analysis Module: Analysis module is concerned with the representation of the source language (MSA) by detecting constituent structures and resolving lexical and syntactic ambiguities, however, The analysis is done through five main phases: lexical database, normalization, tokenization, morphology and syntactic analysis phases, a brief on each of these phases is shown below: 5.1 Lexical databases lexical resources are basically defined as the information associated with individual words. The field of computational lexicography is concerned with creating and maintaining computerised dictionaries (10). practically, rule-based MT systems can have different dictionaries, some containing the core entries, while others containing specialised vocabulary. However, beside the developed set of rule for grammar, derivation, stemming, determinant, and others to be used in the translation process; our transfer-based system will maintain a database for: 5.2 Normalization Normalization is a process that aims to ensure that the Arabic text is steady and predictable for tokenization (Shirko et al., 2010 ) (24). In this module, the following processes are performed 1. Removal of diacritics, redundant and misspelled space 2. Retrieving deleted characters. In Arabic, some characters forming nouns or verbs are deleted due to their position in a sentence or when they are preceded by a special prepositions 3. Resolution of the orthographic ambiguity and in Arabic Removing the stretching character 5.3 Tokenisation Tokenisation: The term token refers to an abstraction for the smallest unit in a text that is considered when describing the syntax of a language. A process of tokenization can be used to split the sentence into word tokens. The token can be a word, a part of a word (or clitic), a multiword expression, or a punctuation mark (Attia, 2007) (18). However, we can reformulate Arabic tokens as the following expression; Arabic token = proclitic(s) (prepositions, conjunctions or determiners) + affix(es) (tense, genus or number marks) + root + enclitic(s) (pronouns or possessives). The tokenization in our system extract clitics, the prefixes and the suffixes of each word in the input sentence. Hence, we will decompose the word into prefix-stem, stem, suffix, prefix(es)-stem-suffix(es), or with no changes. The flow of the tokenization process is shown in figure 1 and the output of this step is a list of Arabic words Arabic words list, we then assigns the total number of words in the sentence to the variable name Arabic words list lentgh. We keep the order of words as shown in figure 1 below. A lexicon of all original words/phrases and their derivations in the source and target languages, this lexicon includes the words meaning and their features such as number, gender, person, case, humanity, and alive. A lexicon of the foreign words/phrases in the languages with their features. A lexicon of the irregular words/phrases in the languages with their features.(16) Figure 1 Tokenization Flowchart

4 Mohammed M. Abu Shquier. et al. 5.4 Morphological Analysis Morphological Analysis: in this phase the analyser provides morpho-syntactic information and understanding the relationship among the different forms which a one word can take, the morphological analyser analyzes each word of the MSA sentence morphologically and applies certain rules before implementing the derivation rules (Habash, 2008) (22). Morphological Analyser select proper derivation/inflection rules based on the subject/noun features as well as the verb/adjective category of the input word i.e., (gender, number, person). All of these features should be taken into consideration so as to get the correct derivation rules (Abu Shquier, 2013) (17). According to (apineni et al., 2002) (13), the analysis of words in a machine translation system is needed to determine their syntactic and semantic properties. However, the morphological generator produces the inflected English words in their correct forms. 5.5 Syntactic Analysis Syntactic Analysis: Syntactic analysis, or parsing, is a major component in a rule-based MT system. It is the process by which a sentence is analyzed into constituent parts, to determine grammatical structure. The syntactic analysis process utilises the Arabic dictionary and grammar rules to check the MSA input text in terms of spelling and grammar, then this information is used to produce the analysis of the text structure as an output (Parsing process). The parser divides the sentence into smaller sets depending on their syntactic functions in the sentence (12). There are four types of phrases i.e. Verb Phrase (VP), Noun Phrase (NP), Adjective/Adverbial Phrase (AP), and Prepositional Phrase (PP). The syntactic analysis tries to handle a large difference of sentence constructions, however, once the tokeniser finishes executing, the parser accepts Arabic words list that builds a sentence and output a list of POS as shown in figure 2. We have used Stanford parser for this purpose. This particular process starts by assigning all possible POS i.e., Arabic POS list for each word Arabic words list in MSA entry sentence. After that it uses the rules to choose the POS which is suitable for combining all of the sentence words correctly. The next process is converting the MSA input sentence into a certain data structure representation. After obtaining the Arabic POS list, some semantic features have been applied for every word in Arabic words list, in which it deals with the agreement features between categories such as Subject and Object. It reduces the ambiguity of choosing the meaning of words. Moreover, the syntactic and generation processes analyze the phrasal structure and categories the Arabic sentence to generate the correct English structure sentence. the analysis and generation. However, the module consists of two main processes, they are: 6.1 Lexical Transfer: This step is mainly designed for dictionary translation. according to Hutchins and Somers (1992) (25), the replacement of a source lexical item by a target lexical item. In our system the lexicon is responsible for inferring morphological and classifying verbs, nouns, adverb and adjectives when needed. The task of this step is using the Arabic-English Bi-lingual dictionary to look up the English meaning for each word in the MSA phrase. This process is done word by word maintaining the same order as the MSA source phrase. The output of this step is a list of MSA words and their equivalent English meanings. 6.2 Structural Transfer: This stage deals with the structure and patterns of the target sentences. The task of this step is to queue the words of target sentence up based on the English grammar rules. The transformation is done through two phases: Building a Bilingual dictionary and the transformation between Arabic and English languages; A Bilingual dictionary is an Arabicto-English dictionary that contains the words in Arabic language and their corresponding equivilant meaning in English, however, the part of speech for the (SL) words is also added to the dictionary beside some other features such as humanity, alive, gender, tense, and numbers. The transformation starts after receiving the Arabic words list and Arabic POS list to generate the English words list. The system looks up in the bilingual dictionary for the translation of Arabic words and obtains the corresponding equivilant English words meaning according to the transformer flow chart as shown in Figure 2. 7 Generation Module Generation is concerned with rendering the output of the target language (English) in a grammatically acceptable form in terms of its grammar structure and meaning translation. 6 Transformation Module Transformation module is used to translate Arabic sentence structures and words, Transfer is the interface or link between Figure 2 Transformation Flowchart

Fully Automated Arabic to English Machine Translation System: Transfer-based appraoch of AE-TBMT. 5 There are two steps to be accomplished in the generation module which are: morphological generation and syntactic generation. The morphological generator utilises English grammar rules to construct the correct forms of the inflected English words (6). However, the task of the syntactic generation is to generate the English sentence in its final structure version. The syntactic generation process accepts the English words list to generate a sentence in a target language (English). It is the second final phase that reordering translated words according to various English rules as shown in figure 3. The target language is generated from source language sentences according to some of the following rules (7): iii. Apply semantic anaysis for every word in Arabic words list[i]. iv. Produce an English sentence structure ( /VBD /DTNN /DTNN /DTJJ). v. The output is an array containing the parts of speech like noun, verb, auxiliaries, adjective, preposition etc. 1. Arabic verb phrase sentences: 1.1. The verb in English sentence preceeds the subject. 1.2. The subject in English sentence preceeds the object. 2. Noun phrase in both Arabic and English sentence have the same order. let us take an example on how the system handles the SL-TL word ordering based on the rules mentioned earlier, for the (SL) where XAL denotes, the corresponding English sentence matches the rule VBD/3;DT/1;XAL/1;NNS/2;DT/4;XAL/4;NN/6; XAL/4;JJ/5; then the reordering database matches the rule DT/1;NNS/2;VBD/3;DT/4;JJ/5;NN/6 based on the sequence the/dt students/nns solved/vbd the/dt difficult/jj problem/nn. The flow of the reordering process is shown in figure 3. Figure 3 Word Ordering Flowchart 8 Implementation and Design this sections manifests the proposed prototype and the entire translation process. full example with processes is also shown in figure 4, the designed prototype utilized the module developed by Hamdy N. Agiza (2012) (7). 1. Analysis Module (Arabic text) 1.1. Input an Arabic sentence (SL) 1.2. Tokenizer i. Divide the Arabic SL into tokens. ii. The result obtained is an n-sized array (n= the number of words). 1.3. Arabic (SL) Parsing (the parsing flow is shown in figure 1.) i. Accepts a list of words Arabic words list[i]. ii. Get Arabic POS list[i]. (Parsing is done by using Stanford parser). Figure 4 System Architecture flowchart

6 Mohammed M. Abu Shquier. et al. 2. Transfer Module (Arabic-English transformation is shown in figure 2) 2.1. Bilingual Dictionary (Arabic-English transformation): This is an Arabic-to-English dictionary that contains the words in Arabic and their corresponding translation in English. Collection of words captures variously from dictionaries, books, newspapers and media. i. Arabic POS words is added to the dictionary beside some other features such as humanity, gender, tense, and numbers. ii. Translate Arabic words to their equivilant English meaning as specified in the bilingual dictionary and the Arabic POS list[i]. iii. The module accepts Arabic words list[i] and Arabic POS list[i]. iv. The output is English words list[i]. 3. Generation Module (English text) 3.1. Synthesis rules of TL (English) i. The system accepts English words list[i]. ii. Reorder English words list[i] based on the English structure list[i]. iii. Generate English sentence. 3.2. English Morphology i. Match English Morphological rules with reordering English words list [i] to obtain a satisfactory translated English TL as shown in figure 3. The system architecture and design is illustrated with an example in figures 4 and 5 respectively. 9 Experiment and Results In order to judge the translation accuracy received by AE- TBMT; we have developed an evaluation methodology. This methodology is based on a comparison between the system outputs with the original translation of the input text. The following steps describe the conducted methodology: 1. Run the system on the selected test case. 2. Compare the original translation with the system output. 3. Classify the problems that arise from the mismatches between the two translations. 4. Assign a suitable score for each problem. A range of score between 0 and 10 determines the accuracy of the translation. While 0 indicates absolutely incorrect translation and 10 indicates absolutely correct (matched) translation. Figure 5 System Architecture with example 5. When a situation belongs to multiple problems compute its score average. 6. Determine the correctness of the test case by computing the percentage of the total scores. In order to improve the translation output, the evaluation methodology is applied on successive stages that include a cycle of translation, error identification, correction, and re-translation until no more changes can be made. In the following subsections we describe the conducted experiment that evaluate the system and incrementally improve its output. 9.1 Experiment The purpose of this experiment is to investigate whether the following machine translation systems, namely, ALKAFI, GOOGLE, TARJIM and our system, are sufficiently robust for conherent translating between Arabic and English. The evaluation methodology is applied on 130 independent test examples taken from different Arabic scientific text and different domain, we call this test group as (test suit). Basically, the methodology is based on applying comparison between the outputs of the MT systems and the original translation for the test examples. The experiment gives the following results as shown in table 1 and figure 6 below.

Fully Automated Arabic to English Machine Translation System: Transfer-based appraoch of AE-TBMT. 7 Table 1 Result of test suit experiment. Al-Kafi Google Tarjim AE-TBMT Matches Sentences 97 94 87 112 Mismatches Sentences 33 36 43 18 Total Score of Matches Sentences 970 940 870 1120 Total Score of Mismatches Sentences 247.6 359.2 313.9 136.8 Matches Sentences 1217.6 1199.2 1183.9 1256.8 Percentage 93.6% 92.2% 91.1% 96.6% The percentage of the total score for each system has been found by dividing the total score by 1300; as we have 130 test examples and each is evaluated out of 10. we have classified the problem caused ill-translaiton and assigned suitable scores for them based on their weight; we have classified the problems according to the following categories as follow: 1. Article-Noun: This problem appeared because the noun phrase that are preceded by a(n) is translated as if it were preceded by the. In other words, the translation nouns and adjectives of this noun phrase are defined. We give an output that belongs to this problem 9. 2. Adjective-Noun: We give an output that belongs to this problem 8. 3. Verb-Subject: We give an output that belongs to this problem 8. 4. Demonstrative-Noun: We give an output that belongs to this problem 8. 5. Relative Pronoun-Antecedent: We give an output that belongs to this problem 7. 6. Predicate-Subject: We give an output that belongs to this problem 7. Order of the adjective: This problem appeared because the translation of the adjective relative to its described noun is not translated in its right order. In other words, the adjective does not follow the described noun in order. We give an output that belongs to this problem 7. 8. Successive words form an expression:this problem appeared because the successive words that form an expression are translated separately. We give an output that belongs to this problem 8. 9. Rough addition and deletion: This problem appeared because the original translation contains extra words that have no corresponding words in the input of the source language. We give an output that belongs to this problem 7. Figure 6 Test Suit results 9.2 Type of Error Frequencies with English-Arabic MT Table 2 represents all type of errors returned by each of the examined system, namely, Alkafi, Google, Tarjim and AE- TBMT, and their frequencies. If we examined the first row for the Article-Noun agreement we will find that this type of error frequented 4 times with Alkafi, 28 times with Google, 4 times with Tarjim Sakhar and only 2 times with our system. Therefore, as a total this type of error frequented 38 times with all of the systems. Figure 7 and figure 8 are representing the type of errors received after getting the translation with their frequencies for the Arabic MT system i.e., Al-Kafi, Google and Targim against our system (AE-TBMT). The conducted experiment shown that our system has scored the highest percentage by 96.6 percent, this means that less than four percent of the entire test examples have not been handled correctly, and this result is considered fair if not good, as the other three systems score below that mark. 10 Conclusion and future work In this study we presented a transfer-based approach to handle the translation of MSA into English. This paper shows that many shortcomings in the output of MT are due to either faulty analysis of the SL text or faulty generation of the TL text. The improvement to the translation can be done only by formalizing our linguistic knowledge and enriching the computer with adequate rules to deal with the linguistic phenomenon (Abu Shquier and Sembok, 2008) (15). the contribution of this paper can be summurised as follows: first: the development of patterns for Arabic and English sentences for translation purposes, second: the development of a MT system prototype which is superior as compared to other three existing systems. third: Highlighting major problems with

8 Mohammed M. Abu Shquier. et al. Table 2 Type of Error Frequencies with Arabic MT system against AE-TBMT. Error Error Type Frequency Error Al-Kafi Google Tarjim AE- Percentage TBMT 1 Article-Noun Agreement 38 10.67% 4 28 4 2 2 Adjective-Noun Agreement 89 25% 19 33 19 18 3 Verb-Subject Agreement 50 14.04% 12 25 6 7 4 Demonstrative-Noun Agreement 5 1.40% 0 3 1 1 5 Pronoun-Antecedent Agreement 18 5.05% 6 6 3 3 6 Predicate-Subject Agreement 47 13.20% 16 10 16 5 7 Order of the adjective 25 7.02% 2 20 2 1 8 Successive words form an expression 3 0.84% 1 0 2 0 9 Rough addition and deletion 81 22.75% 19 41 16 5 Total Frequencies of Errors 356 79 166 69 42 guides that need to be adopted to help deciding the meaning and choosing the right equivalent translation between this particular language pair. References Figure 7 Summary errors results [1] Abdelhadi Soudi, Günter Neumann, and Antal Van. den Bosch. Arabic computational morphology: knowledgebased and empirical methods. Springer, 2007. [2] Alexandra Birch, Phil Blunsom, and Miles Osborne. A quantitative analysis of reordering phenomena. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 197 205. Association for Computational Linguistics, 2009. [3] Alsaket, A.J. and M.J.A. Aziz. Arabic to english machine translation of verb phrases using rule-based approach. J. Comput. Sci., 10: 1062-1068, 2014. Figure 8 Translation percentage results current Arabic to English MT systems and suggest solutions to resolve these problems, and fourth: the construction of a tests suite; that has been used in testing different features that cause inaccurate translation in three Arabic Machine Translation systems, they are, ALKAFI, GOOGLE, TARJIM SAKHR versus AE-TBMT. These examples have been used in exploring and evaluating the faulty translation, In the experiment, we have classified the problems into nine categories and we compare the outputs of those four particular systems with the original translation of the SL. The experiment proves that AE-TBMT has scored the highest percentage. Experimet sheds light on some major issues of available MT systems; i.e., Addition and deletion are serious problems that the developer of Arabic MT systems have to look at. Spelling is another issue that requires attention. The issues discussed herein need to have developed rules and grammars in the future to give full coherent meaning. The lexical environment and collocations are very important [4] Arwa Al-Amoudi, Hailah AlMazrua, Hebah Al- Moaiqel, Noura AlOmar, and Sarah Al-Koblan. An exploratory study of arabic language support in software project management tools. International Journal of Computer Science Issues (IJCSI), 10(4), 2013. [5] Ehsan. A Mohammed and Mohd. J Ab. Aziz. English to arabic machine translation based on reordring algorithm. Journal of Computer Science, 7(1):120, 2011. [6] H. Al-Barhamtoshy and W. Al-Jideebi. Designing and implementing arabic wordnet semantic-based. In the 9th Conference on Language Engineering, pages 23 24, 2009. [7] Hamdy. N Agiza, Ahmed. E Hassan, and Noura Salah. An english-to-arabic prototype machine translator for statistical sentences. Intelligent Information Management, 4:13, 2012. [8] Imad. A Al-Sughaiyer and Ibrahim. A Al-Kharashi. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American

Fully Automated Arabic to English Machine Translation System: Transfer-based appraoch of AE-TBMT. 9 Society for Information Science and Technology, 55(3):189 213, 2004. [9] Ines. Turki Khemakhem, Salma Jamoussi, and Abdelmajid. Ben Hamadou. The miracl arabic-english statistical machine translation system for iwslt 2010. In IWSLT, pages 119 125, 2010. [10] John Hutchins. Machine translation: A concise history. Computer aided translation: Theory and practice, 2007. [11] Kenneth. R Beesley. Finite-state morphological analysis and generation of arabic at xerox research: Status and plans in 2001. In ACL Workshop on Arabic Language Processing: Status and Perspective, volume. 1, pages 1 8, 2001. [12] Khaled Shaalan. Rule-based approach in arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT), 3(3):11 19, 2010. [13] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311 318. Association for Computational Linguistics, 2002. [14] M. M. Abu Shquier, Mohammed. M Al. Nabhan, and Tengku. Mohammed Sembok. Adopting new rules in rule-based machine translation. In Computer Modelling and Simulation (UKSim), 2010 12th International Conference on, pages 62 67. IEEE, 2010. [15] Mohammed M. Abu Shquier and Tengku Mohd. T Sembok. Word agreement and ordering in englisharabic machine translation. In Information Technology, 2008. ITSim 2008. International Symposium on, volume. 1, pages 1 10. IEEE, 2008. [16] Mohammed M. Abu Shquier, KSA Tabuk, and Omer M. Abu Shqeer. Hybrid-based approach to handle irregular verb-subject agreements in english-arabic machine translation. [17] Mohammed M. Abu Shquier. Computational approach to the derivation and inflection of arabic irregular verbs in english-arabic machine translation. Int. Journal of Advancement in Computing Technology IJACT, Vol. 5, No. 15, pp. 1 21, 2013 [20] Mohammed Albared, Nazlia Omar, and Mohd. Juzaiddin Ab. Aziz. Developing a competitive hmm arabic pos tagger using small training corpora. In Intelligent Information and Database Systems, pages 288 296. Springer, 2011. [21] Mohamed Attia Mohamed. Elaraby Ahmed. ALarge- SCALE COMPUTATIONAL PROCESSOR OF THE ARABIC MORPHOLOGY,AND APPLICATIONS. PhD thesis, Faculty of Engineering, Cairo University, 2000. [22] Nizar Habash. Four techniques for online handling of out-of-vocabulary words in arabic-english statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short 08, pages 57 60, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. [23] Nizar Habash and Jun Hu. Improving arabic-chinese statistical machine translation using english as pivot language. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 173 181. Association for Computational Linguistics, 2009. [24] Omar Shirko, Nazlia Omar, Haslina Arshad, and Mohammed Albared. Machine translation of noun phrases from arabic to english using transfer-based approach. Journal of Computer Science, 6(3):350, 2010. [25] William. John Hutchins and Harold. L Somers. An introduction to machine translation, volume 362. Academic Press London, 1992. [26] Yasser Salem, Arnold Hensman, and Brian Nolan. Implementing arabic-to-english machine translation using the role and reference grammar linguistic model. 2008. [27] Yasser Salem, Arnold Hensman, and Brian Nolan. Towards arabic to english machine translation. 2008. [28] Zainab. Abd Algani and Nazlia Omar. Arabic to english machine translation of verb phrases using rule-based approach. Journal of Computer Science, 8(3), 2012. [18] Mohammed. A Attia. Arabic tokenization system. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources, pages 65 72. Association for Computational Linguistics, 2007. [19] Mohammed Albared, Nazlia Omar, Mohd. Juzaiddin Ab. Aziz, and Mohd Zakree. Ahmad Nazri. Automatic part of speech tagging for arabic: an experiment using bigram hidden markov model. In Rough Set and Knowledge Technology, pages 361 370. Springer, 2010.