Automated Online English -Arabic Translator

Transcription

1 Faculty of Engineering & Architecture Electrical & Computer Engineering Department Final Year Project - Spring 2006 Automated Online English -Arabic Translator Supervisor: Prof. Ali El Hajj Report Grader: Ali Chehab Group: # 26 Hicham Yamout & Karim Kanso

2 Table of Contents Table of Contents List of Figures Introduction..3 Literature Review A. Brief History...6 B. Typology of systems and translation demands Dissemination Assimilation Interchange Database Access C. Fully automatic systems for dissemination Post-editing MT output Preparing input for Machine Translator Controlled language and Machine Translator Lexical Resources..13 D.Types of Machine Based Translation Systems Design & Analysis A. Tools B. Coding C. Rules D. Interfaces E. Database Design Implementation..33 A. PHP, MySQL Apache. 33 B. Database...33 C. Main Page. 41 D. Text Selection / Creating Rules 42 E. Add words /inserting to DB...43 F. Coding translation Evaluation...52 Project Constraints 55 Conclusion...56 Appendix..57 A. Appendix A Time Table...57 B. Appendix B Sample Coding C. Appendix C Installing necessary software References 78 A. Papers Websites - Books Dictionaries

3 List of Figures Figure 1: Human-aided machine translator... 8 Figure 2: Translation Process Figure 3: User input Interface.. 23 Figure 4: AMA Table Design...25 Figure 5: Rule Table Design. 26 Figure 6: Sample table bea. 34 Figure 7: some rows of table zzzrule.. 35 Figure 8: some rows of table zzzrule.. 36 Figure 9: some rows of table zzzen2ar Figure 10: sample translation Figure 11: Main Page Figure 12: sample translation Figure 13: Add words Input Form 46 Figure 14: omit_space 48 Figure 15: sample translation Figure 16: sample translation Figure 17: sample db zzzrule. 49 Figure 18: sample translation Figure 19: sample translation Figure 20: Add words form - error control 51 Figure 21: sample translation Figure 22: sample translation

4 Introduction People around the globe have access to many texts, papers, articles and websites over the internet. However not all of these texts are in their native language or in languages they understand therefore people need to translate these texts into languages they understand. Translation is an activity comprising the interpretation of the meaning of a text in one language and the production of a new, equivalent text in another. It should ensure that both the source and the target texts communicate the same message, while taking into account a number of constraints which include context, the rules of grammar of the source language, its writing conventions, its idioms and the like. Traditionally, translation has been a human activity. There have been however many attempts to automate it (machine translation) or computerize it (computer-assisted translation). One major progress in this field is the use of digital translators which provide word by word translation, which is time consuming if we are translating large texts. The same applies for many online translators who are capable of translating only words. The real challenge is to translate texts and not only words, to generate meanings instantaneously, and to translate them accurately. People have a need to understand Arabic, and Arabs need to understand other languages over the internet, all by an automated system and at no cost. The Automated system couldn t replace human translation but it can help those seeking to understand some paragraphs/texts for an immediate translation. Our aim is to design an Automated Accurate English Arabic Online Translator that suits the big number of translation requests, and that is able to respond efficiently and accurately. Our FYP aims at helping people understand and deal with foreign languages. When one is browsing the internet and finds a certain text in a language that he doesn t understand, he simply copies that text and pastes it into a specific place over the internet to get its translation. This is in simple terms what we are trying to design. Our Automated Online Translator would be able to provide easy and quick way of translating languages to people via the internet, including the translation of words, paragraphs, texts, and if time permits even websites

5 This service can be implemented as an application or a machine translator to be used on the internet which we call an Automated Online Arabic Translator since we will be working on translating from English to Arabic. The challenge in machine translation is to program a computer to understand a text as a human being does and also to produce a new text in the source language that sounds as if it was written by a human being. The automated procedure we are designing is much easier and quicker than translating words of a paragraph each one at a time using a dictionary or a digital pocket translator. What is also very interesting in designing the translator is to make it easy to use by all people. Nowadays most machine translation systems produce an output that is in many cases a poor translation compared to the input. What machine translator designer mostly care about is to provide the meaning in general of the text under translation to the user because computer translation systems implementation is a hard task and getting very accurate results is really a challenge. What we would try to do is to reach a system performance that is as close as possible to the structure and meaning of the given input. Information about the design, implementation, and about the problems that we may encounter or that may rise from this kind of project will be demonstrated and illustrated all along the rest of our final year project report. We need to distinguish here two basic types of machine translators. The first one is that which is designed to provide assistance and help to human translators during translation processes, such as the help in linguistic rules and language grammar. The second type of machine translator is the wholly automatic system that attempts to translate sentences and texts without any human intervention during the process of translation. It is this type of machine translator that we will be trying to implement during our work on the project. Machine translators would be designed to operate according to different algorithms and translation policies. Some would function according to statistics called "Statistical Machine Translators" (SMT), others would be based on rules and are called "Rule-Based Machine Translators". A third machine translator would be the one designed to translate according to text examples and is called "Example-Based Machine Translators"

6 Our policy for the database design will mostly follow the rule-based and example-based ways of implementation because together they will help us to come up with a well structured database that would provide good translation output for almost all kind of input. More details about the types of machine translators, about the types of the databases available and their structure, as well as our work, design and implementation of the system will be explained in this report in the literature review and design and analysis sections

7 I- Literature Review Once we are thinking of computer based machine translation, important questions come to our mind: Why is it interesting to develop computer based machine translators? Why are computer machine translators considered to be important? How useful can computer based machine translators be and what advantages and benefits can we get from developing them? Translation using computers can help people perform their needed translation faster. Translation of a large amount of data can be made possible in a short time, which reduces human effort and time consumption as well. Companies and organizations may require translation of large amount of data instantaneously and in here computer based translation machines are very useful tools especially when words to be translated are encountered many times and their translation is demanded to be the same at each time for technical reasons. People on the other hand need a lot of time to translate long texts, and the same word may have different meanings within the same text and thus require a different translation, this is because human beings tend to seek variety, which makes this task even harder. The fact that translation using computers is not very accurate and translation quality through these means is not very high is not a cause that prevents people from using it, because in practical life speed is needed and an understandable level of translation could be enough. A very important point which reveals the advantage of using computer translation especially for companies and in the business field is that the translation cost here is reduced. In addition, computer based translation is fast and more economic than human translation. A. Brief History: Machine Translator is not a new idea as many believe it is. In fact, machine translator was thought about since before electronic digital computers existed, and the idea to develop a computer based translation machine has been proposed since 1947 when the first non-military computers were being invented. Serious work in the domain has started in 1949 and a little bit later machine translation applications became the first non-numerical applications on computers. In 1959 a system was installed by IBM at the Foreign Technology Division of the US Air Force, and in 1963 and 1964 Georgetown University, one of the largest research projects at the time, installed systems at Euratom (European Atomic Agency) and at the US Atomic Energy Agency. But in 1966 there appeared a report for machine translator from a committee set up by most - 6 -

8 of the major sponsors of machine translator research in the United States (ALPAC 1966). At the beginning, at the time where the first researches and studies on machine translation, the output of such systems and the results achieved were for many years, until the 1970s, considered to be poor, to the extent that many American projects came to an end and the Americans were thinking of stopping the work in the field. Nevertheless researches continued, and in 1970 the Systran translation system was installed at the US Air Force (replacing the old IBM system), and that system for Russian to English translation is still in use to this day. In the 1980s there was a revolution of research in the machine translation field. The Japanese companies for example began the production of commercial systems, also computerized translation aids became more familiar to professional translators. In 1981, came the first translation software for the newly introduced personal computers, and gradually machine translators came into more widespread use. Computer machine translators have developed a lot since then, and in the last five years, applications of computer translation were developed for the service and the use of all people worldwide on the internet. Researches and experiences in studying and working to develop and improve machine translator systems have showed that the more the subject domain of the application is restricted the more the output is accurate and the better is the quality of the system and the results. B. Typology of systems and translation demands: Machine Translators, without getting into details of algorithms or the implementation of the applications, are mainly divided into two basic types. According to the article of John Hutchins about "Current commercial machine translation systems" the first type of machine translators is called an "automatic type" and the second type is designed to provide aids to human translators. Automatic machine translator systems perform translation without any human intervention. This system s function is to translate sentences and texts as wholes. A disadvantage of the automatic translation systems is that the output is poor, and companies for example have to provide human assistance for quality improvement. The second type of translator machines are those designed to provide technical linguistic help to the human translators. This kind of translator provides aids for - 7 -

9 human translators in the form of dictionaries and grammars. Databases of texts and their translation are implemented into the system as a help supplement. Any type of machine translator can be used for 4 basic functions. The basic functions are: Dissemination, Assimilation, Interchange, and Database Access. 1. Dissemination: A translation machine has the function of dissemination when it is used to produce translation of publishable quality or the translation of published texts. Such translation production is required and needed by organizations. The output is usually not adequate and it needs human assistance (translators, revisers, terminologists) such as a revision of the output (post editing) as the diagram below, of a machine in a dissemination function, illustrates. Figure 1: Human-aided machine translator The output can also be improved by updating the dictionary of the system, by augmenting lexical information from a database of approved terminology to ensure consistency of usage and by implementing a substantial body of translated texts as a basic corpus on which to design the system. As it was mentioned before, restricting the translation machine to a specific domain will increase the accuracy and improve the quality of the output. The machine translation engine in the diagram above ("MT engine") can be dedicated to a certain domain, for example engineering or medicine etc, by implementing definitions and constraints concerning only the field or domain the machine translator is based on. The machine translator in a dissemination function is seen as a translator into which the text input can have different forms. The input can be either unedited ( raw ) or it can be controlled in some way, which means that either it can be pre-edited by inserting various markers in the text to indicate how ambiguities or difficulties can be overcome, or it can be composed in a controlled language, a language to which appropriate guidelines are set. The input can also be - 8 -

10 designed to be regular and compatible in some way with a specific machine translator system as we can see in the diagram. The input intended to be translated could be internal reports, operational manuals, repetitive publicity and marketing documents. Operational manuals, for example, often consist of many thousands of pages to translate and this is very boring and time consuming for human translators, in addition to the fact that these operational manuals are repetitive and need to be translated into many different languages with frequent updates. To get the output with the required good quality and frequent updates, companies use of automated translation systems would be necessary. Guidelines are also set for use at the output level of the system. The machine translator system should also be integrated with technical language. The machine translators are used a lot by companies and organizations which should perform all steps needed to improve the translation system. Enterprise systems are client-server systems which are large servers that house the machine translator system itself and PC-type clients linked on an intranet, often trans-national or transcontinental. The machine translator software should be designed to satisfy the particular needs of the company. The system must come with a large basic dictionary, the larger the dictionary the better the output, plus substantial technical dictionaries, depending on the specific needs of the company. Company dictionaries will have to be made in a way, so that the system must have the facilities for easy dictionary creation and maintenance. The machine translator system must be able to be run on operating systems compatible with those already used by companies (e.g. UNIX, WindowsNT, Sun Solaris, and Linux). 2. Assimilation: It is when translation is made for occasional or personal users. The need of translation here is only to get the general meaning of the text translated. In this case the output doesn't need to be edited and poor quality can be acceptable. 3. Interchange: Translation machine can be used to interchange information and ideas between people in different languages using , the telephone or through correspondence. In this case, again the output of the system doesn't need to be very accurate or very close to the original text in translation. The most important thing is that people would be able to communicate, share ideas, understand the meaning of the message they receive, and convey their intentions. 4. Database Access: The use of the machine translator to translate data and information from a database, such as texts of web pages from the internet

11 C. Fully automatic systems for dissemination 1. Post-editing MT output: For texts which are intended for dissemination, post-editing is essential. We have seen before, that post-editing consists of revising the output of the translation system and editing or correcting it if necessary. The output of the system is edited if errors take place during translation. The errors arise from the fact that computers have difficulties in understanding and dealing with many aspects of the language (like pronouns, coordination, prepositions etc ) or because misspelling may occur in the input texts despite the work of spell checkers. Misspelling errors will cause words not to be recognized by the machine which will affect the translation of the rest of the sentence. Missing of punctuation or its misplacement also affects "machine translation" recognition of the correct meaning of the sentence and consequently affects translation negatively. Computer translation of complex sentences (long sentences of more than one clause), tends to be accompanied with errors and is less successful than the translation of simple sentences. We can take a simple example to illustrate closer the role of human translators and their post-editing of the output to correct mistakes that may happen. Postediting depends largely on bilingual skills acquired over time. The advantages of translators as revisers are that they can maintain quality control. Consistency of terminology, repetitive and familiar matter can be maintained by the machine translator system, but the linguistic quality can only be maintained by human translator. Following is an example of a Spanish sentence that requires translation: el desarrollo de programs de educación nutricional... The machine translator system might produce: the development of programs of nutritional education Which a post-editor would correct as: programs in nutritional education English prefers "in instead of "of" after this type of noun. The above post-editing is important if the translation output will be published, otherwise if the text is for personal use post-editing won't be necessary and its cost could be avoided because readers can still understand the meaning of the uncorrected text version of the output

12 We have mentioned before the importance of pre-editing in machine translators that are intended for dissemination use. The amount of post-editing in machine translators can vary according to the form and type of the input. The more input texts are standardized and the more the text is uncreative, the more accurate will the output be and the less correction using post-editing is required. Unfamiliar word combinations and sentence constructions will result in poorly translated texts. 2. Preparing input for Machine Translator: Mistakes resulting from translation can be corrected using post-editing, but the mistakes can be avoided initially by adequate preparation of the input texts. This preparation is based on writing documents in a language designed for a machine translator system. This process is called pre-editing, which can be done at the following various levels: 1- To indicate whether a particular word has a particular function. 2- To indicate whether a particular name is to be translated or not. 3- To indicate the boundaries of compound nouns. For example, we want lightbulb to be translated in French always as ampoule and not as bulbe léger or oignon léger. To ensure this the compound can be enclosed in brackets. 4- To insert punctuation. For example in a sentence such as There are he says two options we may need to ensure that he says is treated as an embedded phrase. It could be enclosed in commas or bracketed, e.g. There are, he says, two options or There are (he says) two options To split long sentences into shorter ones. This is done because experience shows that MT (machine translator) cannot deal well with long complex sentences. It is in effect a form of controlled language. To perform pre-editing, the knowledge of the target language is not necessary, but it is important to know that in a specific target language a particular phrase is likely to be misinterpreted. 3. Controlled language and Machine Translator: A machine translator system can be adjusted to a specific domain in two possible ways: 1 - Designing the system specifically for a particular sublanguage. 2 Restricting the vocabulary and the grammatical structures of texts input of the system. Each of the above mentioned two different ways of translator system adjustment is called the controlled language design of a machine translator, which consist of

13 adjusting the input text so that it would be convenient for the translator machine to translate. Such a design enables input texts to be standardized and consistent, so that they are unambiguous for the machine translator system. Controlled language machine translators are intended to produce better quality output by minimizing the problems of lexical selection and structural change so that the output may need little or no post-editing. Examples of the kinds of rules to be followed by writers in controlled languages: 1- To use acceptable terminology. In the case of a car manufacturer for example, there might be a rule specifying the use of windscreen rather than windshield. 2- To use only approved senses of ambiguous words, e.g. follow should be used only in the sense of come after and not of obey. 3- To avoid ambiguous words like replace, since it can mean either "remove a component and put it back again", or "take the component away and put in a new one". Similarly, the word appear can mean come into view, or be possible, or show. The rule may state that appear should be avoided altogether. 4- To use one topic per sentence e.g. one command or instruction at a time, which reduces the complexity of sentences. 5- Not to omit articles to avoid ambiguity. 6- The avoidance of phrasal verbs such as pour out, put in, look up, look into,.. etc, since the machine translator analysis of such verb forms often causes problems. 7- To use short sentences. Often guidelines for controlled languages specify a maximum of 20 words per sentence. 8- To avoid coordination, since it always causes problems in machine translation systems. For example in the phrase homeless women and children, the adjective probably modifies both nouns, but in pregnant women and children the adjective modifies only the first noun. As an illustration of the effects of controlling source language input, take the sentence: After agitation, allow the solution to stand for one hour The phrase after agitation does not indicate the object of the action. It is better to be more precise and to avoid the ambiguous agitate by rephrasing as: If you shake the solution, do not use it for one hour. Another example illustrates the rule about keeping to one topic (or command) per sentence. Instead of: It is very important that you keep all of the engine parts clean and free of corrosion

14 The controlled language form could be: Keep all of the engine parts clean. Do not let corrosion occur. 4. Lexical Resources: Machine translators must all have substantial dictionaries that have to be kept continuously up to date. In the past, implementing dictionaries for machine translators was a hard, long and difficult task. Every entry of the dictionary had to be entered separately and information about each entry went well beyond the information given or found in printed dictionaries. The database of old machine translator systems had to include detailed information concerning grammatical categories, morphological variants ( such as all possible noun and verb endings), syntactic contexts of word usage, and semantic constraints on the word occurrences. All possible translation alternatives and the circumstances in which each might be selected had also to be added to the database. All this detailed information was required because computers were designed to operate according to precise lexical and grammatical rules that printed dictionaries never provided. Nowadays, machine translation systems are based on much less language rule details than in the past. They operate according to shallower syntactic information, and rarely incorporate complex rules for avoiding ambiguity. Instead there is much more use of information about word occurrences derived from text corpora and it is now possible to extract information, for compiling dictionary entries directly and automatically, from bilingual corpora. It is difficult for companies that may require dictionaries covering many different subject areas, to find those dictionaries with the definition and the size of the needed domains. Therefore companies usually prefer to create their own dictionaries for their machine translator systems, since they have their own peculiarities of terminological uses. The developers of a system can turn to electronic text corpora, such as those on the internet, from which they may be able to extract relevant lexical information. Once the dictionary has been acquired for a company system, the terms have to be validated and authorized. There should be a regular trawling of lexical and terminological resources and a continuous effort for creating, maintaining, and validating machine translator lexical information. The dictionary should also always be regularly updated because vocabulary in all areas is constantly changing. Nowadays a large number of commercial software packages are available which perform terminology extraction and support terminological management

15 There are a number of companies which develop customized machine translation systems with controlled language input, e.g. Smart Communications (New York) for many multinational organizations (Citicorp, Chase, Ford, General Electric, Canadian Ministry of Employment, etc.), Xplanation (Belgium) using the METAL system, VTT Information Technology (Finland) using the WebTranSmart system, and ESTeam (Greece), all covering a wide range of European languages. Systems covering the major Western European languages (English, French, German, Italian, Portuguese and Spanish) include Systran Enterprise, LogoMedia Enterprise, Reverso Intranet, SDL Enterprise Translator, and IBM WebSphere all are client-server systems designed specifically for large company translation services. Another group of systems cover English and Far Eastern languages (Japanese, Korean, and Chinese), e.g. Fujitsu ATLAS, EWTranslate, IBM WebSphere, and Systran Enterprise. Other systems cover Arabic and English (AppTek s TranSphere) and Finnish and also English (Kielikone s TranSmart). Now we can take a look at the different types of machine translation systems as well as the different design methodologies, rules and designs that could be followed with the advantages and disadvantages of each. D. Types of Machine Based Translation Systems Many researches have been made after the late 1980s on the design of the machine translator. In all MT (Machine Translation) systems the most important thing to look at, is the process by which elements of the input SL (Source Language) text are converted into their equivalent elements for the output TL (Target Language) where the output text should reflect the same meaning of the input. Translation is represented by three stages. The first stage is the process of analysis that precedes the conversion which is the second stage. The third stage after the conversion is the stage of synthesis. The article of John Hutchins, "Towards a definition of example-based machine translation", emphasizes on three main approaches that were employed for implementing translation which are the "Statistical Machine Translation (SMT)", the "Rule-Based Machine Translator (RBMT)", and the "Example-Based Machine Translation (EBMT)". The SMT was based on word frequency and word combination. It computes the probability that a given input string in a source language translates in its equivalent meaning string in target language. The SMT design consists of aligning sentences in a bilingual text. Individual words of source language and corresponding translated words of target language texts are aligned and brought into correspondence. The translation then involves the selection of most probable

16 target language words for each input word and the determination of the most probable sequence of those selected words in the target language, so that a source language-target language frequencies are developed according to which the SMT is based. In other words the target language words are synthesized in meaningful strings which are intended to be as equivalent as possible to the input sentences. The basic units of SMT are therefore words. In SMT we can see that the analysis process consists of extracting each word alone from the input texts and match them with individual words present in the entries of the system. In RBMT, the process of conversion is based on the use of bilingual dictionaries and rules for converting SL (source language) structures into TL (target language) structures or using the dictionaries and rules for deriving intermediary representations from which output can be generated. Before analysis the input SL strings are interpreted into appropriate translation units and relations. After synthesis TL texts are derived from the TL structures or representations produced by the conversion process stage. The EBMT follows a translation approach based on the extraction and combination of phrases, or even short parts of texts (not individual words) of the source language from a big text, which would be its database, to build texts in the target language with the same meaning. The basic units if the EBMT are therefore phrases. The EBMT appears the most developed and useful way of translation and its implementation and design appears to be the most difficult because it consists of the combination of many techniques of translation used in the SMT applications that we have just discussed or those used in the translation memories. The procedure according to which EBMT is based is the following: - The alignment of texts. - The matching of input sentences against phrases (examples) from stored database. - The selection and extraction of equivalent target language or translated phrases. - The adaptation and combination of translated phrases as acceptable output sentences. Mainly, the operation of the EBMT translation design system is founded on finding or extracting examples of target language sentences that are analogous to input source language sentences. The extraction of appropriate translated sentences is preceded by an analysis stage for the decomposition of input sentences into appropriate fragments. The processes of analysis (decomposition) and synthesis (recombination) are designed, respectively, to prepare input text for matching against database and to produce output text

17 In EBMT, rules could be used, such as in RBMT (Rule-Based Machine Translation) only when examples of the source language to be translated into the target language happens not to be found in the machine database. According to this way of functioning, EBMT has been seen as a better way of translation compared to RBMT. The matching examples approach for translation can work more successfully on more different types of languages. Consider the translation from English to Japanese. The implementation of a translation system that is based on searching for words or sequence of words in its database that match the Japanese input sequence of words (sentences or phrases), and then looking for their equivalent English translation, would be a more error free and much better treatment of collocations than when using a rule based machine translator. This is because the English and the Japanese language structures are much different from each other and therefore rules are very different as well. Many studies and analysis have proved that since EBMT is based on actual texts, output translations would be more readable and more sensitive to contexts than the RBMT systems. EBMT will lead to a higher quality in idiomaticity and appropriateness. The improvement of the EBMT is much easier than that of RBMT, since EBMT systems are improved by adding more examples into the database, while in RBMT systems a modification and addition of more complex rules and lexical entries would be necessary, a procedure that may never be necessary in EBMT. The addition and formulation of those rules is harder and more time consuming than by simply adding sentence examples or expressions. The Rule Based approach in the translation machine fields has proven to be a complicated approach and Rule Based Machines have showed failures in may cases involving complicated structures. At this point the EBMT is less prone to failure because studies have shown that it can deal with complex structural differences and subtle lexical choices in an easier way than the RBMT, and results are more accurate. In general, the argument in favor of EBMT is its potential to improve the generation of TL sentences. The essence in EBMT is therefore the matching of SL fragments (sentences, phrases, strings) from an input text against SL fragments in a database of bilingual example texts (in the form of strings, templates, tree representations) and the extraction of the equivalent TL fragments. It can be argued also that the operations of synthesis or recombination (perhaps the most difficult and complex in EBMT systems), are the consequence of the nature of the output from the matching/extraction process. Now let's concentrate on the main characteristics which make a Machine Translator (MT) called an EBMT. The fact that EBMT systems use a bilingual corpus and that the translation stems from examples is still a wide definition

18 because almost all MT systems use bilingual corpus for translation, and the databases of SMT systems are also derived from corpora of example texts. EBMT systems differ in an important point from SMT systems, which is that the example texts used in EBMT systems are used at run-time. At this level SMT are excluded from EBMT, because the data used in SMT is derived in advance of the translation process. Here it becomes clear how SMT and EBMT systems differ. The fact that translation of EBMT stems from database of examples can be encountered in RBMT systems as well, as a source of data from which generalized rules and patterns can be derived. Therefore the use of a database of examples in MT systems is in itself no justification for labeling the system as EBMT. Any MT system has to deal with constructions of sentences and languages when translation cannot be done compositionally according to certain rules. Therefore even RBMT systems that perform translation on the basis of rules will sometimes avoid structural analysis and this type of RBMT is called lexicalist RBMT system. Now facing this similarity between the linguistically-principled EBMT systems and transfer-based lexicalist RBMT systems an important question should arise about how sentences are decomposed in EBMT during the matching process for translation and how they are decomposed during the analysis in lexicalist RBMT. In order for RBMT systems to deal with constructions which cannot be translated compositionally, it needs to have access to a repository of non-monotonic contexts (examples). In such RBMT the repository is extracted (created) from dictionary or text sources. The RBMT system in this case can be seen as an EBMT in which the repository used in operation may also be extracted from the resource (the example database) as an operation of knowledge called "explicit knowledge". This means that the linguistic information used by EBMT is indistinguishable from the information used by the lexicalist RBMT. In other EBMT architectures, when there is direct reference to the example database during the processing of sentences or translation, we say that the repository is used as an "implicit knowledge" database. We can deduce that a translation machine is classified as EBMT and not RBMT only when the machine uses during translation the original full database of examples. This procedure of translation is the one consisting of the use of the entire examples and it is referred to the conception of the "translation by analogy" and represents the most characteristics technique of EBMT. In other words we can say that the only true EBMT systems are those where the information is not preprocessed, and unanalyzed but is available intact throughout the matching and extraction processes

19 Advancements in the MT systems field have been able to develop SMT systems similar to some extents to EBMT systems. Those systems are phrase-based SMT systems. In phrase-based SMT systems parsing or statistical parsing is performed to improve alignments, to facilitate the matching of input strings (rather than just individual words), to allow for the analysis of input sentences as phrase structures and matching against parsed sentences in the database. Here we can see therefore a similarity between SMT and EBMT where parsing is part of the pre-processing stage and where it is also part of the analysis (decomposition) and matching stages, but keeping in mind that the most SMT processes remain word and string based. After this analysis we can see that phrase-based SMT systems and EBMT systems represent variants of a common framework, but with an important difference that SMT works mainly on the basis of statistical methods and EBMT works mainly on the basis of linguistic or symbolic fragments and especially text examples. The hypothesis behind EBMT approach in translating is that translation involves the finding of analogues (similar in meaning and form) of SL sentences in existing TL texts. On the other hand, neither SMT or RBMT work on analogues basis. SMT uses statistically established word and phrase correspondence, and RBMT works with representations of sentences, clauses, and words of equivalent meanings. We can observe therefore and conclude that EBMT makes use of both statistical (SMT like) and symbolic or linguistic (RBMT like) methods. It would be important if we have an idea about the databases of the above mentioned translation systems. It is necessary for any MT system to have a bilingual dictionary of some kind accompanied with a set of rules that are very important for the manipulation of differences of words order between SL and TL. The information needed for the translation systems is inserted into a database and it may be derived from different resources such as bilingual and monolingual texts, bilingual and monolingual dictionaries, grammars, etc. In SMT, the database consist of a bilingual text corpus (aligned in order to correlate SL sentences and words and TL sentences and words) and the rules are replaced by information about frequencies of correlations between SL words and TL words and collocations of TL words in texts. In EBMT, the database consists of an aligned bilingual text corpus which is the set of examples, and the rules are replaced by examples of TL strings in the text corpus

20 In addition to the content of the database we mentioned above, traditional bilingual dictionaries can be used in SMT and EBMT systems as well as monolingual thesauri. The MT systems should also have access to information necessary for decomposing (analyzing) and combining (generating) sentences. The studies and researches done on languages have led to observations of pattern frequencies between and within languages and has given us the capability to derive (explicitly or implicitly and indirectly) the rules necessary for the translation in RBMT systems. In EBMT and SMT, information about well-formedness of sentences and strings is implicitly incorporated in the bilingual databases. The information is implicitly extracted for matching and conversion is as far as input strings have to conform to the practices of the SL, otherwise matches will not be found. Similarly information is implicitly used in the synthesis stages by reference to a monolingual language model (in SMT) and by extraction of well-formed TL fragments (in EBMT). We can deduce that knowledge about sentence formation, is explicit in RBMT, and is implicit in EBMT and SMT. It was mentioned earlier that EBMT systems translate according to a run-time procedure. In run-time systems, the full database of examples is made accessible and performs any required manipulation during matching and extracting processes. Such use of the database is essential to the basic operation of converting SL input into TL output

21 II- Design and Analysis We know that we are trying to do something that usually only human do, we believe that our automated online translator needs to think like a human translator, and we know that using our programming skills and our mind we might be able to achieve that. In order to achieve this creative work, we considered the following points of main importance: 1- A kid starts by learning alphabet, words, sentences, grammar. 2- A translator is always in need of a dictionary, and how a human search for words is taken into consideration. 3- there are basic rules of translation that are being taught in schools-universities, thus we are reading books that teach translation and see how far can we teach a machine these similar rules. 4- Translation is about meaning; in addition automation will help speed translation. 5- Don t expect to translate more than small paragraphs (4-5 lines). 6- Learn from the mistakes of others (compare with other(s) online translators on the internet). 7- Google, AltaVista didn t implement yet any online Arabic translator (it is certain that this is a very hard project) although it is a very important topic. A. Tools The Final Automated Translator is expected to be published over the internet free for use, however for testing purposes we will program it locally on our machine and install all the needed software to do so. Therefore in addition to a Domain name for the website, it is estimated that the translator will get a big number of traffic and thus Web hosting should be made over our own dedicated server. We have installed and configured the required software on the server: PHP (A Free Programming Language) MySQL (A Free Open Source Database) Apache (A Free web server) PHPMyAdmin (A Free Open Source Web Based database Editor)

22 B. Coding We have selected PHP as our main programming language. PHP is a widelyused general-purpose scripting language that is especially suited for Web development and can be embedded into HTML. We have selected MySQL as our SQL database and will integrate PHP with SQL to process all queries. Using PHP & SQL we will get the input from the user, tokenize the words written, apply some rules by accessing the rules database, translate the words by accessing the dictionary translation database, provide Arabic rules to translated text and display the output. C. Interfaces The Translation System is divided into many stages; these stages might be increased/modified during Implementation. The General Overview of the Translation Process is described in the following figure:

23 Figure 2: Translation Process

24 Figure 3: User Input Interface Refer to Appendix A for index.php code D. Database Design Database design is considered one of the most challenging problems that we are facing, the database is so huge since it contains an English dictionary, an Arabic dictionary and a set of rules related to English and Arabic. During this project and as we said before, we had in mind that in order to make the translation system more accurate, we need to make it approach everything like humans do. Therefore to design the database, and taking into account the huge number of English words, we referred to one big dictionary [3] and counted the total number of pages that are reserved to each set of words (by alphabetical order), and this work gave the following results:

25 A: 60 J: 10 S: 137 B: 60 K: 9 T: 60 C: 104 L: 38 U: 16 D: 54 M: 55 V: 17 E: 31 N: 19 W: 39 F: 42 O: 25 X: 2 G: 30 P: 96 Y: 4 H: 37 Q: 6 Z: 4 I: 41 R: 51 We considered the above statistics a helpful resource in designing the English words database, and divided the table as follows: 1. AMM: Table constituting all words starting from A and the first 3 letters ending at AMM such as: abatic, abatis, abbacy, abbatial, abet. 1. AMZ: Table constituting all words starting from AMN and the first 3 letters ending at AMZ such as: anemographic, anemography, anemometer, anemometric, anemometrical, anemometry. 3. AZN: Table constituting all words starting from ANA and the first 3 letters ending at AZN such as: androgenetic, androgenic, androgenous, androgeny, androglossia, androgyne, androgynous, androgyny, android, andromeda. 4. AZZ: Table constituting all words starting from AZO and the first 3 letters ending at AZZ such as: Azimuth, azoic, azoth, azotic, Aztec, azymous, azure

26 For Letters with a small number of pages such as the letter N, the corresponding table would be as follows: NA: Table constituting all words starting from NA and ending at NM NN: Table constituting all words starting from NN and ending at NZ For Letters with a very small number of pages as X, the corresponding table would be as follows: X: One Table is enough to handle all the words that are in 2 pages. With the same procedure and depending on the total number of pages that we analyzed from the dictionary we will be proceeding for all the letters. Taking all the above database designs into consideration will certainly increase the search speed; now the automated translator would identify the first letter(s) of the word inserted by the user and know directly where to search. The following is a screenshot of the AMA Table as viewed in PHPmyAdmin, containing rows ID, EN, Genre and Rule. Figure 4: AMA table design

27 EN would include one of the English words, Genre would include the Genre of the word whether verb-adverb-, Rule would contain a number that point to certain rows in the rule Table. The rule table contains all the equivalent EN-AR translations and additional hints/rules for each translation to help the translator know which Arabic translation to choose from. For example: Table: AMA ID: 256 EN: ambition Genre: Noun Rule: 3200 Figure 5: Rule table design

28 Table: rule ID: 3200 طموح AR: Genre: N Rule_1: this field would contain information that would help the translator select for the Arabic translation طموح Rule_2: this field would contain information that would help the translator select for the Arabic translation طموح Nmb: this would indicate the number of rules available for طموح word, here 2 (Rule_1 & Rule_2) Table: rule ID: 3200 مطمح AR: Genre: N Rule_1: this field would contain information that would help the translator select for the Arabic translation طموح Rule_2: this field would contain information that would help the translator select for the Arabic translation طموح Nmb: this would indicate the number of rules available for طموح word, here 2 (Rule_1 & Rule_2) Table: rule ID: 3200 الطموح AR: Genre: N Rule_1: this field would contain information that would help the translator select for the Arabic translation طموح Rule_2: this field would contain information that would help the translator select for the Arabic translation طموح Nmb: this would indicate the number of rules available for طموح word, here 2 (Rule_1 & Rule_2) We can add more rules to each row, the next rule will get field: Rule_3 and Nmb would be increased by one, and thus by referring to the rule database it will reveal to the machine translator a set of options (Arabic word translations) with the respected rules, making it more probable to select the correct word

29 E. Rules The Rules are included within English Intelligent-Rules & Arabic Intelligent - Rules Steps and they are divided as follows: Grammar Rules English-Arabic Rules Linguistic Rules Translation Rules Rules are what add to automated translation the overall meaning of the text and thus our automated translator wouldn t perform inaccurate word by word translation. We believe that knowing, processing and adding thousands of rules specific to the English and Arabic Languages gives the automated online translator a higher probability of providing an accurate translation. In what follows are some of the rules that we started generating through analysis of many texts. In many cases we prefer to put general rules that we implement by coding, which would increase speed by reducing search time, decrease database words and provide same translation results. 1) Example: He is smart. Translate he into Arabic when it is followed by an auxiliary. هو ذآي Example: He speaks English. ي تكل م ألانجليزية Omit the translation of he when it is not followed by an auxiliary. 2) Example: The book is the best friend. إن الكتاب أفضل صديق book آتاب friend صديق best الا فضل أفضل => (best is translated as الا فضل when it happens to be at the end of a sentence like: This car is the best..(هذه السيارة هي الا فضل

30 3) Example: I heard the two boys speak in a low voice. س معت الولدبن ي تكل م ان بصوت منخفض hear يصغى يسمع heard س معت إستمعت two إثنان boy صبي غلام ولد boys الولدبن الولدان الا ولاد speak يخاطب يتكلم يتحدث He speaks يخاطب يتكلم يتحدث She speaks تتكلم => (for "she" in the present, we take the same verb conjugated for he but instead ). ت at the beginning we put ي of => (in the past for "he" we take all the verb as in present except the first letter..) Example: He spoke تكلم => (in the past tense for "she" we take the verb in the past for "he" and we add at ) ت the end of the verb Example: She spoke تكلمت => (in the past tense, for the pronoun they (plural masculine), we take the past of he and we add وا at the end of the verb. In the feminine plural case the past tense is again formed by taking the past of he but ن should be added at the end of the verb). Example: ت كل موا (masculine) they spoke ت كل من (feminine) they spoke 4) Example: The teacher praised both students who answered correctly. م دح المعل م آلا الطلاب الذين أجابوا بشكل صحيح teacher المعلم المدرس praise ثناء تمجيد إطراء

31 => (Above, is the translation of "praise" as a noun, and it is as such when the word happens to come after a preposition for example such as: a praise. When this word follows a noun then it should be analyzed as a verb and it is translated in the form shown below: ) praise (to praise) م دح آلا آلتا على حد سواء معا both => ("both" is translated as معا when it happens to refer to the subject. Example: They ate both. They went both. "Both" here refers to they. Usually also when "both" ends the sentence. ) student الطالب who التي الذى الذين اللواتى من answer جواب يجيب correct صحيح correctly بشكل صحيح => (correctly is an adverb. Adverbs are recognized by "tly" at the end. This is translated in Arabic by adding بشكل or ب to the word in Arabic. Example: quick سريع quickly بسرعة or سريع بشكل.( Inversion rule: Example: The student's book. آتاب الطالب book آتاب student طالب In this case, we have to translate first the word book and then the word student because of the " 's " that follows the word student. (Possessive s). *The book of the student. آتاب الطالب In this case the words are translated in Arabic in the same sequence as they happen to be in the sentence. We apply this method when the noun, in this case "book" is followed by "of the". One of the English Arabic general Rules is:

32 If a word does not appear in the database then its English letters will be translated to their equivalent Arabic letters and the word will be represented in the terms of its equivalent Arabic letters only. For example: Mercedes car سيارة مرسيدس Example: English Letters and their equivalent Arabic Letters: a b c c in front of consonants d e f g h ch, sh th i j k l m n o p q r s s between two vowels th t v w x y z ا ب س ك د - ض ء - ا- ي - ف ج ه ش ث اي ج ك ل م ن او ب ك ر س ز ث ت ف وا - و اآس واي ز

33 III- Implementation A. PHP, MySQL & Apache Our Automated Translator is a web-based software; as previously mentioned all the requirements for this language and database have to be installed. Thus we have gone through a long procedure for installing PHP, MySQL, PHPMyAdmin and Apache. Kindly find the whole process for installing the needed software in the Appendix C. B. Database The project is essentially based on the design of the database. After we ve completed the design of the database, implementing it was not easy at all. As we said earlier, we used an English-Arabic dictionary to estimate the number of words we have for each letter in the English alphabet, however we noticed that it is not enough to divide the database tables according to the first letter of the words, therefore we divided the tables into 2or 3 characters depending on the number of words starting with these characters. Thus, our design for dictionary words was in the following way: For example the letter b: We counted how many words we have in English starting with "ba", then with "bb", as well as for "bc", and we continued in the same manner until we reached the end of the "b" section in the dictionary. Upon finishing the count, we approximately knew how much space we have to reserve in the database for words starting with "ba", "bb", "bc" till "bz". Referring to the "Excel" sheet in the following page we can see the way we organized and divided all the letters. We have placed all the letters of the alphabet once in a column and once in a row, so that we have each letter of the alphabet followed by all other possible letters, thus covering all possible combinations. For example: The letter "a" in a word can be followed by "a", "b", "c" or by any other letter in the alphabet, so in the "Excel" sheet, we placed "A", "B", "C", etc in the column and "a", "b", "c", etc, in the row

34 From the English-Arabic dictionary we counted how many pages are there for the words starting with "aa", "ab", "ac" etc We placed each number we counted in its corresponding cell. For example in cell "Aa" we placed the number of pages in the dictionary which contain words starting with "aa". We have done the same thing for all other cells. From the number of pages we inserted in each cell we knew the number of tables we should reserve in our database for each group of words. For example for words starting with "aa", according to the number we inserted in the "aa" cell, reserving one table in the database would be enough. The table will be called "aa", and all words starting with "aa" will be added to that table in the database. This procedure is applicable to all letters of the alphabet and to all words we may encounter. For cells where we found a large number of pages, 6 pages and above, we had to divide this cell into 2 or 3 cells. Taking for example the case of words starting with "bi", we have found in the dictionary that the number of pages containing words starting with "bi" is 6.5. To facilitate the search and make it faster, we decided to make two tables in the database for words starting with "bi". One table for words starting with "bia" until "bil" and the table is named "bia", and another for words starting with "bim" until "biz" and the table is named "bim". According to this design, the search for words starting with "bi" will be concentrated either in the table named "bia" or "bim". The search for words starting with "bia" until "bil" will be in table "bia" and the search for words starting with "bim" until "biz" will be in table "bim". In this manner instead of placing all the words starting with "bi" in only one table, we have placed them in two separated table, so that instead of searching among a big number of words in one table, we target our search in either table "bia" and "bim" which contain a fewer number of words starting with "bi" which makes the search faster to get quicker results. In the Excel table we can see that the starting letters of some words are represented by cells containing a number which is less than 1, such as 0.1, or 0.5 etc. An example of these cells is the cell named "bl", which contains the number 0.1. We have seen that it would be better if we put words starting with letters of the cells with numbers less than 1 in only one table, because the number of such words is small. For example, for words starting with the letter "b", we have cells "bc", "bd", "bh" and "by", all containing numbers less than 1. This means that the number of words in the alphabet starting with letters "bc", "bd", "bh" and "by" is small, therefore we decided to put such words in only one table. In the case of the letter "b" the table is named "bzz" in the database and represented by cell 'bzz" in the Excel table

35 Now for every cell in the Microsoft Excel table we created a corresponding table in the MySQL database with the same name as that cell and added the appropriate fields (Genre, Rule, etc ). A sample of the code for creating the first 3 tables in the database can be seen in Appendix B. An example of such process is illustrated in the figure below: The following table named bea corresponds to cell bea in the Excel table with all the necessary fields added to it. Figure 6: Sample table bea So far, we have completed the design and implementation of the tables that should include all the English words of the dictionary. To translate those English words another 3 important tables which we named "zzzrule", "zzzen2ar" and "zzzconfig" were also added to our database. A link had to be done between the English word in its table and its appropriate translation in the zzzrule table for translation to take place. This link was performed through the Rule field of all the tables of the English words. As an example, for the word bed in table bea above, the Rule field of the word bed contains the number 79 which means that the word bed was the 79 th English word inserted into the database. Moreover this Rule field will also

36 be a link to the zzzrule table where the Arabic translation of bed is inserted under a field called ID containing number 79 as well. We can look at the figure of the zzzrule table below which illustrates the link process between the Rule and ID fields and shows the Arabic translation of the word bed : Some rows of the zzzrule table are shown below: Figure 7: some rows of table zzzrule Description of the different fields that exist in the zzzrule table: ID: field linked to a specific English word with Rule field same as ID. AR: field that contains the Arabic translation/word of a specified ID. Genre is the Genre of the Arabic word. followed_by_word: this field if used will help in better choosing the Arabic translation. The translation of some English words into Arabic may differ depending on the words they are followed by. For example: He is translated differently according to the word that follows it

37 Knowing the different cases we may face with the word He, our code should be designed to check word that comes after He using the followed_by_word field. The program will check the different contents of the followed_by_word field that are inserted into the database for the word He, and the translation will be performed according to the one that matches our input. يلعب be: Example: He plays, He is followed by a verb and translation will He is smart, He is followed by is and the translation will be for هو ذآي to get هو as: He is altogether Figure 8: some rows of table zzzrule followed_by_type: this field if used will help in better choosing the Arabic translation. Arabic translation of some English words will change according to the type of the words that follow. For example: To is translated differently according to the word that follows it. Knowing the different cases we may face with the word To, our code should be designed to check the type of the word that comes after To using the followed_by_type field. The program will check the different contents of the followed_by_type field that are inserted into the database in table zzzrule for the word To, and the translation will be performed according to the one that matches our input. Example: He went to the bank to collect money. When checking the zzzrule table for translating the word to, the code will check the followed_by_type field. In the first time to is followed by the preposition the, therefore the Arabic translation of the word to in the zzzrule table that corresponds to the followed_by_type field containing prep (that stands for preposition) will be selected to translate to. In this case. الى the translation will be

38 For the second to in the sentence, the Arabic translation that corresponds to followed_by_type field that contains the letter v (for verb) will be selected.. ل In this case the translation will be The final translation of the sentence will be: He went to the bank to collect money. ذ هب الى المصرف لج مع الم لا. Omit_space: some words require no space between the current word and the following word. For example ال which is the Arabic translation of The, has to stick to the word that follows. Add_at_index: this field is used when there should be a change in the order of the English words while performing their translation. For example if we want to remove a word from being the first word and putting it 3 words further at index 4 in order for the meaning of the translation to be reached. Nmb: this is an indicator that directs the translator to the field (followed_by_word, followed_by_type, add_at_index or None) which is active and thus need to be processed

39 Some rows of the zzzen2ar table are shown below: General Rule: Figure 9: some rows of table zzzen2ar Whenever the translator reaches a word and this word is not found in the database, then this word needs to be translated as it is pronounced. In order to achieve this, a set of English to Arabic translation of character by character has to be done. The table responsible for this conversion is the zzzen2ar table shown in Figure 9. The zzzen2ar table contains all the English letters in the alphabet and their equivalent Arabic letters. Moreover we have added two fields to help the translator better reach the Arabic word equivalent in pronunciation. The fields are: followed_1_char & followed_2_chars

40 This word by word translation works as follows: Assuming we come into the word newjersey which is not in the database, The first character of the word n is selected and the program will look for that character in the zzzen2ar table. Once the rows of the character n are found then the followed_1_char and followed_2_chars fields are checked. If there was a match for example we can see that n is followed by ew and in figure 9 before, we can see that there is a match with the raw 35 where n character and followed_2_chars is ew ; thus the translator return نيو and it automatically jumps to the letter j after new to proceed. And thus in this translation we always lookup forward for the following 1 or 2 characters in the English word

41 Figure 10: sample translation 1 In what follows we find the excel sheet that we produced by going through all the pages of the English dictionary and deciding on all the table names that should exist in the database based on the number of pages per characters combination

42 41

43 C. Main Page The second step following the design of the database was creating the Main page form; this form is responsible for getting the input of the user, processing the translation and then providing the output to the user. In the Coding section below we will provide more information on how the process is done and the steps we worked through to reach the translation. The main web interface of the automated online translators looks like this: Figure 11: Main Page The main page includes a text area to write the English text (Left to Right direction). After the translate button is clicked, a new text area (Right to Left) will appear showing the Arabic translation of the English text and under this text area we can see the translation process time. The main page with full features is shown hereafter. 42

44 Figure 12: sample translation 2 D. Text Selection / Creating Rules We have selected a certain number of sentences written in English with their Arabic translations. The selection was made from an English to Arabic translation book. These sentences were chosen according to the rules that were found inside them. As a start, our aim was to find some simple sentences not very hard to translate that contain easy Arabic and English rules to implement. Easy and small sentences would facilitate our beginning in implementing the code that would translate the English sentences into Arabic, and in addition we will be able to test our code and database relations. 43

45 This work of retrieving translation sentences was followed by analyzing and interpreting the rules in the sentences. From a simple code structure that translates simple rules, we could in this way implement and develop a much larger and complicated code that is capable of translating more complex rules and longer sentences. Our escalation in selecting the phrases was based on the rules each sentence contained, going the stairs one stair at a time was very rewarding, thus we started from sentences that could be translated with no rules, and then we went to sentences which would need one rule, and then proceeded to sentences with multiple rules! The Apostrophe s Rule: The translator fills the words in a array, and scans the words for an s If the apostrophe is found, the translator reorders the words in the English sentence before translating them to Arabic. For example the possessive s of Hicham s book will activate the inversion rule, آتاب هشام " as and translation will be 1-we detect s ; & Replace by word in array aposss 2- Array now contains [Hicham][aposss][book] 3- in the code show below, we remove Hicham and put it in a temp variable, We put book at the index of [Hicham] we put book in the place of [aposss] and we delete the last index آتاب 4- array now contains [book][hicham] only after translation this becomes هشام " ******* $del=0; for ($x=0;$x<sizeof($arrsrc);$x++){ //echo $arrsrc[$x]; //echo "<br>"; if($arrsrc[$x]=="her" or $arrsrc[$x]=="his"){ $temp = $arrsrc[$x]; $arrsrc[$x]= $arrsrc[$x+1]; $arrsrc[$x+1]= $temp; $x++; } else if($arrsrc[$x]=="appossss"){ $temp = $arrsrc[$x-1]; if($arrsrc[$x-2]=="the"){ $arrsrc[$x-2]= $arrsrc[$x+1]; $arrsrc[$x]= $arrsrc[$x-1]; $arrsrc[$x-1]= "the"; $del=$x+1; } 44

46 } else{ $arrsrc[$x-1]= $arrsrc[$x+1]; $arrsrc[$x+1]= $temp; $del=$x; } } if($del!=0){ $arrsrc= del_element ($arrsrc, $del); } ******* E. Add words /inserting to Database After selecting the sentences to translate, we had to extract the words from those sentences and insert them each one alone with their own Arabic translation into the database. We used an English to Arabic dictionary to find all the possible different Arabic translation of the English words. For the aim of adding words and their translation to the database we have developed a special form to perform this task. The English word is entered in the table below in the space provided for the "English Word" option. The genre of the word is selected from the "Select English Genre" option. The Arabic translation of the word with its genre is added also into the database. Figure 13 below shows the procedure of entering the English word into the database with all the information needed with it. A table called "zzzconfig" was designed, which contains the total number of words we have in the database. When a word is inserted into its appropriate table in the database the "Rule_Index" field in the "zzzconfig" will be incremented by 1 and its value will be given to the "Rule" field of the newly inserted row. 45

47 Figure 13: Add words Input Form In the average, more than 180 English words were added to the database. 46

48 F. Coding translation In order to be able to translate, we needed to program different sections; the first task was to program the main page. The main page is the ultimate page, where all the processing goes. In order to let the main page interact with the database we have created 2 files: db_config.php which contains the database configuration information, and sql_connect.php which is used to connect to the "sql" database after processing the configuration file(db_config.php). How does it work! In the main page we start by including the sql_connect.php file since in the main page it s all about communicating with the database. First we take the English input that the user enters and then we fill the English words delimited by spaces (Tokenized) into an array and then the translation process starts. At first we entered the text to be translated word by word to the appropriate English table, got the Arabic meaning from the zzzrule table and output it. This process ensured for us that we are going word by word and the words are found in the database. This step most of the time does not give the real meaning of the sentences, it s just like translating word for word. We have created a major function get_table_name ($string1, $string2, $string3). We send to this function the first 3 characters, first 2 characters and first character of the word to translate, and based on these arguments it should return what is the table name of the word we are trying to translate. we have programmed this function to follow the db structure we showed earlier in excel. **** $strlen = strlen ($var); //get word length $firstchars3 = substr ($var, 0, 3); // get first 3 chars $firstchars2 = substr ($var, 0, 2); // get first 2 chars $firstchars1 = substr ($var, 0, 1); // get first char //the code of the get_table_name is about 200 line of code. **** And this was only the beginning 47

49 In order to reach the real meaning/sentence structure of the Input, we started to implement rules. The first rule we tried to work out was when there is The in a sentence, in Arabic it should be ال and it should be stuck to the following word as well (no spaces should occur). In order to achieve this we had to get use of the omit_space field and set it to 1. Figure 14: omit_space An example of the result can be seen as follows: Figure 15: sample translation 2 48

50 Figure 16: sample translation 3 Other rules would suggest the use of followed_by_word and followed_by_type fields. For example, if He is followed by is then we should return the following (as we have discussed before), Figure 17: sample database zzzrule 49

51 Figure 18: sample translation 4 And for example if He is not followed by "is" in this rule then we will not return هو and in the case of "He was" the output shouldn't be: آان,هو but He was should return only: Figure 19: sample translation 5 50

52 Thus, we found ourselves able to transform each rule and sentence we worked on to it s Arabic form, however the coding was not easy at all, and the programming of an online translator have to be a set of joint venture between different table fields with PHP programming techniques. Still we believe that we are able to reach to a full working automated online translator based on what we have done already. The Main page code contains a lot more of programming which include the way to solve the possessive s rule and others. Add words Coding We have also created a form for inserting English words with their appropriate Arabic translations into the database, it is the form shown in Figure 13. The form structure is based on 2 files addwords.php and Addwords_add_check.php Addwords.php contains the code which is responsible for displaying the form (the Input of the User) and once submitted Addwords_add_check.php takes the input from the first form and processes it and accordingly it either enters the word to the database while displaying Some word Was added successfully to the database. or rejects the user action while displaying the error: Figure 20: Add words form - error control 51