Using Web Search for Machine Translation Nicolas Wehmeier BSc Computing and German 2003/2004

Transcription

1 Using Web Search for Machine Translation Nicolas Wehmeier BSc Computing and German 2003/2004 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Summary Machine translation is the process of using a computer to translate text from one language into another. This project explores and evaluates the use of the World Wide Web as a resource for example-based machine translation tasks. The aim of this project is to create a translation tool that uses the World Wide Web as a source of linguistic examples to translate adjective-noun phrases from English into German. An evaluation process carried out by native speakers of German judges the effectiveness of the tool. I

3 Acknowledgements Firstly, I would like to thank Ann Lawson at Oxford University Press for kindly allowing me the use of the Oxford-Duden German dictionary, without which the project could not have been completed. Thanks also go to Haiko Müller, Thomas Schneider and Jürgen Wehmeier for being my native speakers of German and carrying out the translation evaluations. Finally, and most importantly, I would like to thank Katja Markert for all of her help and guidance throughout the project. II

4 Contents Summary...I Acknowledgements...II Contents...III 1. Introduction 1.1 Aims and introduction minimum requirements Background Research 2.1 Machine Translation Evaluation of machine translation The Web for machine translation Collocations Collocation definitions Collocational criteria Selecting collocations from corpora Tools and resources British National Corpus Gsearch Unix core tools Bilingual dictionary CELEX lexical database Regular Expressions Methodology and Project Management 3.1 Objectives Project Management Implementation of adjective noun-translation tool 4.1 Generation of candidate translations Searching a dictionary Handling agreement Translation selection using Web search Collocation Finder Extraction of adjective-noun phrases from a corpus Selecting collocations by frequency...21 III

5 6. Evaluation 6.1 Translation of the adjective-noun phrases Evaluation procedure Designing the evaluation Evaluation results and interpretation Fluency Fidelity Agreement and disagreement between evaluators Conclusions Summary Comparison to similar research Possibilities for improvement...30 Bibliography...31 Appendices A Personal Reflection...33 B Example of the translation procedure...35 C Example of dictionary entries and output from CANTRAN...36 D Example output from INFLECT...39 E Adjective-noun phrases...41 F Translations and number of hits from Web search...42 G Evaluation form and results...44 IV

6 1 Introduction 1.1 Aims and Introduction This project aims to investigate the use of the World Wide Web as a resource for machine translation. An example-based machine translation tool is to be developed that exploits the huge amount of information contained on the World Wide Web. For this project the tool will be developed to be used specifically to translate adjective-noun combinations. As an example to highlight the size of the Web, compare the counts for the following randomly selected adjective-noun phrases as found on the Web to the frequencies from a corpus available in the School of Computing, the British National Corpus (BNC). Column two in Table 1.1 gives the frequencies as found in the BNC (a large corpus). The third column gives the number of results of the same phrase as returned by the WWW search engine Google ( Phrase Google BNC old gelding 19,100 2 addictive behaviour 9,920 3 accurate reflection 81, easy access 3,780, positive feedback 717, Figure 1.1 Table showing counts of random adjective-noun phrases found in the British National Corpus and on the World Wide Web using the Google search engine in April As can be seen in each case, the number of hits returned from the Web search is far greater than the number of occurrences in the BNC, showing the potential of the Web as a resource for obtaining linguistic examples. This project aims to see whether these vast numbers of examples can be exploited and applied as a basis for machine translation tasks. The specific translation process involved is a two-stage process. Firstly, the tool will generate candidate translations of an adjective-noun phrase. This is achieved by translating each individual word in the phrase with the use of an existing dictionary. Combining all of the adjective translations with all of the noun translations forms candidate translations for the adjective-noun phrase. Secondly, the candidate translations are submitted to a World Wide Web search engine and the correct translation is selected, based on the number of results from the search. It is assumed that the phrase 1

7 which receives the highest number of hits is the most common one and, therefore, most likely to be the best translation. An example of the overall procedure, simplified and carried out by hand, prior to the development of the translation tool, can be seen in Appendix B. Although the design of the system will allow it to be used to provide potential translations for all adjective-noun phrases, a further aim of the project is to evaluate how the tool behaves when applied specifically to collocational phrases. Therefore, a method of selecting a number of collocational phrases to be translated will be required. Additionally, some phrases that are potentially less collocational will also be translated, as a means of comparison. 1.2 Minimum requirements The following are, therefore, the minimum requirements of the project: A simple frequency-based finder for adjective-noun collocations in English corpora Use machine readable dictionaries to obtain translations for English adjective-noun combinations into German A program that uses the google API to choose between several translations based on simple frequency Statistical evaluation of the whole procedure based on a guided evaluation session to be completed independently by two native speakers of German A comparison to the approaches used in literature 2

8 2 Background Research 2.1 Machine Translation Jurafsky and Martin (2000) define machine translation as the use of computers to automate some or all of the process of translating from one language to another. Translation, of any sort, is often viewed as an art form in itself and human translators require a high degree of skill to produce accurate translations. Nevertheless, according to Kay (1997), there is much about [translation] that is mechanical and routine. Since the earliest development of computer development attempts have been made to harness computing power and apply it to translation tasks. Machine translation has developed greatly since its inception and Papineni et al (2002) believe that machine translation progress stems from evaluation and evaluating the quality of translations provided by any system certainly remains an important task. Only by recognising where and why a tool succeeds (and fails) can we gain an idea of how to improve the system Evaluation of machine translation Translation remains an intensely human endeavour (Jurafksy and Martin, 2000) yet some of the tasks may be carried out by computers and studies have attempted to analyse and evaluate the quality of translations as provided by machine translation approaches. Carroll (1966) states that a translation, whether mechanical or human, is characterised by (a) its intelligibility, and (b) its fidelity to the sense of the original text and these appear to be the major criteria to evaluate when looking at the quality of translations. The intelligibility (also referred to as the fidelity ) of a translation relates to how intelligible and credible the translated text is in the target language. This evaluation criterion deliberately ignores the original source-language phrase, instead it is tested independently and in his experiment Carroll states that when the evaluators measured intelligibility they did so without reference to the original. For his method of evaluation, Carroll devised scales with which evaluators could rate the quality of translations, for both fluency and fidelity. The fluency scale is such that I would be able to apply it when creating the evaluation for my own study. However, the scale used by Carroll to rate fidelity is not appropriate to my work. His scale does not rate fidelity directly, but rather scores the translation by assigning an informativeness value to the original text when compared to the translation. This is designed to see how much extra information is gained when the original text is seen, and thereby show how much information from the original has been left out in the translation. This type of test is 3

9 better suited to large amounts of text and would not work well when applied to the simple adjectivenoun combinations translated without any contextual information in my study. Other studies for the evaluation of translated texts are also available and a scale devised by Nagao et al (1985) is reproduced in Benoit et al (1993). Nagao et al designed a seven point scale to be used for judging the accuracy or fidelity of translations. In presenting Nagao s scale Benoit et al again stress the importance of separating the different parts of the evaluation procedure to ensure that the evaluators judgements were not influenced by the correct translation. A further important issue presented by Benoit et al concerns considering the intended users of the translated text and analysing their reasons for conducting a machine translation. Some users may be interested merely in ascertaining the general subject matter of a text or gaining an approximation of the content rather then being concerned with obtaining a fully grammatically correct translation. In a further proposed method of evaluating the quality of translations, Papineni et al (2002) propose an automatic method of evaluation, known as BLEU (bilingual evaluation understudy). This proposed method would be automatic, quick and language-independent, correlating highly with human evaluation techniques. Although this method of evaluation is of interest, it is less applicable to the study that is being carried out here. It is primarily designed to be used when quick and frequent evaluations are required, such as when developing a full machine translation system. A more thorough method of evaluation, that is only carried out on a single occasion, is more appropriate for an evaluation of the method I am investigating The Web for machine translation The piece of research that is most similar to that which I am carrying out is by Grefenstette (1999). In this paper he shows how it is possible to use the Web as a corpus for translation purposes. A major difference between my work and that of Grefenstette is that he carried out the work for noun-noun phrases whereas mine is concerned with adjective-noun phrases. Grefenstette used compounds and multi-word phrases found in German and Spanish and translated their subparts individually to create a large number of candidate translations for the phrase. However, Grefenstette only selected compositional noun phrases that fitted with his particular criteria so that the search would be more likely to yield good results. He only selected compounds that could be decomposed into two words in the source languages and whose English translations consisted of two words. My study uses automatically selected items which should test more effectively how well the Web translation method actually works. If, as in the case of Grefenstette, you select the phrases that will work well anyway then the study becomes obviously biased. 4

10 Following the generation of candidate translations of the noun-noun phrases Grefenstette then selected the best translation by entering the candidates into a Web search engine (in this case AltaVista) and then choosing the most commonly occurring translation, thereby testing the theory that the correct translation is the most frequently occurring one. My method differs slightly in that I have created an automatic translation tool with the intention being that any adjective-noun phrase can be entered and the correct translation generated. Grefenstette s approach yields favourable results in that for German 87% of the time the most common search engine results gave the correct translation, as tested by comparing with the dictionary translation. For Spanish the figure is 86%. My method of evaluation does not involve comparison with a dictionary but an independent evaluation process with native speakers of German. Although his method is perhaps in some respects a little biased Grefenstette does prove that the World Wide Web is a useful source for linguistic examples even though he acknowledges that some linguists cringe at the idea of using this uncharacterised and dirty corpus to derive linguistic information. Another paper that discusses the use of the Web as a resource for linguistic tasks is by Volk (2002) who says that "the World Wide Web can be seen as the largest corpus ever" and that through searching the Web one can exploit the frequency information. Volk s paper is useful in that it provides numerous examples of the many and varied ways in which the Web has been used as a resource for computational linguists. The rise of the use of the Web as a resource for linguists is emphasised by the fact that the September 2003 issue of the journal of the Association for Computational Linguistics was a special issue dedicated to the Web as a corpus. Kilgarriff and Grefenstette (2003) gave an introduction to it and therein it is stated that language scientists and technologists are increasingly turning to the Web as a source of language data. As an answer to the question of whether or not the Web is a corpus, Kilgarriff and Grefenstette say yes. 2.2 Collocations Collocation definitions To help form an idea of what is meant by the term collocation I consulted some linguistics text books (Palmer, 1981) and (Leech, 1981). 5

11 Palmer (1981) describes collocative meaning as consisting of the association a word acquires on account of the meanings of words which tend to occur in its environments. From studying the literature it certainly appears that this is an important aspect of what a collocation is. Words, when placed together, can take on a meaning which cannot necessarily be deduced from the individual words. This concept is known as non-compositionality. Manning and Schuetze (1999) also give a definition of collocation and state a collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. This shows that collocations often involve idiom which explains why it is often so difficult to translate them Collocational criteria As well as simply defining the meaning of the word collocation, Manning and Schuetze identify three criteria that can be used to test whether a phrase is likely to be a collocation: Non-compositionality: The meaning of a collocation is not a straight forward composition of the meaning of its part. Non-substitutability : We cannot substitute other words for the components of a collocation even if, in context, they have the same meaning. Non-modifiability: Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. When it came to identifying the collocations from the list of candidate phrases, these are the basic criteria which were used to base my decisions upon. However, these criteria cannot be seen as definitive and the selection of collocations remained, in some instances, quite a demanding task and one that, by its nature, involved a reasonable amount of subjective decision making. The fact that the literature debates the subject of selecting collocations so frequently shows that there is no general consensus on the true meaning of the word collocation Selecting collocations from corpora The literature also contains information very relevant to my project on identifying collocations in a text, especially the frequency method which is the method that I will be using. On this subject Manning and Schuetze write that if two words occur together, then that is evidence that they have a special function that is not simply explained as the function that results from their combination. 6

12 Although it is shown in literature that the frequency method is an effective way of finding collocations, other methods are also written about in the literature. For example, one method is to see if the number of times that two words occur together is significantly higher than the number that would occur by pure chance. This method, known as hypothesis testing, involves looking at the probability of two words occurring together and then carrying out significance testing using predefined testing values to see if words that occur together frequently are doing so because they are collocations or just because the words individually occur frequently and, thus, there is a high chance that they will occur together frequently. 2.3 Tools and resources: British National Corpus An essential resource required to carry out this project is a suitable corpus from which the phrases to be translated can be extracted. Without this the project could not be carried out at all. A corpus available in the School of Computing is the British National Corpus (BNC) and this is an extremely valuable resource, not least due to its size, as it contains 100 million words (90 million from written text, 10 million from spoken text). The size of this resource makes it a very useful to give a fair representation of language used in the English-speaking world. Because the BNC is encoded in SGML it makes it very simple to search through with the correct tools. The BNC is tagged by part of speech which makes it easy to search for and extract particular patterns of words; in my case all instances of an adjective followed by a noun. I had to spend some time studying the documentation provided for the BNC in order that I understood some of the background to it and also to see which tags were used for which parts of speech so that I would successfully be able to search for the phrases I required Gsearch A tool that is required for the collocation finding part of the project is the gsearch tool. This is a program that searches through encoded texts and can extract from them only the parts that are required i.e. for this project all cases where a noun is preceded by an adjective. The advantage of the program is obvious as it allows automatic searching of very large texts in a reasonably fast time in a way that would be impossible to carry out by hand. In order to search the BNC correctly using gsearch it was necessary for me to study the documentation provided with the program (Corley et al, 2001). Although not necessarily overly complicated the syntax for formulating queries did need to be studied in order that my search was carried out correctly as I wished. This is particularly important 7

13 when considering the size of the BNC and the resulting fact that searches take significant amounts of time to complete Unix core tools For ordering the phrases output by the gsearch program I required the standard Unix text processing tools uniq and sort. Church provides some relevant documentation, giving useful examples of ways to effectively use them with regard to natural language processing tasks Bilingual Dictionary In order to generate candidate translations an existing, machine-readable German-English dictionary is required. Ideally the dictionary to be used will be of a large size in order to provide a range of translations for each word. Initial research into obtaining a suitable dictionary proved somewhat unsuccessful. Although a number of dictionaries are available in electronic form are free of charge, the underlying data has not been made available to the public. These are, therefore, of no use when trying to carry out automated tasks with them. One alternative tool available was a word list, available online. Although it would have been possible to have used this to form translations it would not have been highly satisfactory as the number of translations for each word was not very high. An advantage of such a word list, however, is that it only contains single-word translations for each word and this would simplify the translation process. The dictionary that was eventually obtained is the SGML encoded version of The Oxford-Duden German-English Dictionary (Oxford University Press, 1999) This is a high-quality resource to have available and is described by the publishers as being the product of nearly ten years of joint work by two of the world's foremost dictionary publishers. We know that, this being a commercial product, it is likely have been subject to a large degree of research and quality control and is likely to provide accurate and complete translations CELEX lexical database CELEX (CEntre for LEXical Information) (Baayen et al, 1995) is a lexical database developed at Nijmegen in the Netherlands and holds information about words in the English, German and Dutch languages. 8

14 This information contained within CELEX can be applied for inflecting adjectives when creating candidate translations because it stores the various wordforms which are connected to the uninflected lemmas. Therefore, by knowing the lemma, the requisite wordform can be generated, depending on the situation. CELEX consists of individual database files that are plain ASCII text files with each entry occurring on a separate line and divided into fields with a backslash separating each field. The following is an example of the format, here an extract from the file gml.cd: 1\A\563\M\1\Y\Y\Y\A\N\N\N\N\(A)[N]\N\N\N\N\S3/P2\N 2\Ae\4\M\1\Y\Y\Y\Ae\N\N\N\N\(Ae)[N]\N\N\N\N\S3/P2\N 3\aalen\1\Z\1\Y\Y\Y\Aal\N\N\N\N\((Aal)[N])[V]\N\N\N\N\r1\N 4\Aal\80\M\1\Y\Y\Y\Aal\N\N\N\N\(Aal)[N]\N\N\N\N\S1/P1\N Without any accompanying documentation the entries do not appear to provide a great deal of information so the accompanying documentation was consulted in order to ascertain what is represented by each field in each file. The CELEX user manual for German (Baayen at al, 1995) gives a detailed description of all that is covered by the database and each file is also accompanied by a README which gives the meaning of each field in the file Regular Expressions In designing the translation tool use must be made of regular expressions as a means of searching through texts. Ramsay presents a guide outlining the concepts involved and providing numerous examples to illustrate their application. 9

15 3 Methodology and Project Management 3.1 Objectives The methodology for this project relates closely to the objectives of the project, as detailed in the midproject report. These are reproduced below: 1. To carry out background research relevant to the project area. 2. To search a corpus and obtain from it a list of adjective-noun phrases. 3. To select from these phrases the ones most likely to be collocational, based on frequency (the automatic collocation finder) and collocational criteria. 4. To use a dictionary to obtain candidate translations of these phrases, based on translations of the individual words. 5. To create a program, using the google API, that uses the translation and selects the most appropriate translation, based on the number of hits from the web search. 6. To give instructions for two native speakers of German to independently evaluate the effectiveness of translations provided by the tool. 7. To compare the approach with other approaches in literature. This project is not primarily concerned with the development of an application but rather with the analysis of a method of performing machine translation. Therefore, no specific existing methodology is easily applied to this type of project. Rather, it was most important to select the correct tools to carry out the tasks required 3.2 Project Management Each of the objectives above relate to specific milestones that needed to be completed in order to facilitate progress with the project and they, therefore, lend themselves naturally to the formulation of a schedule. Objective 1 was carried out prior to the commencement of any practical work to give the necessary background knowledge required and gain an appreciation of the tools involved. This was scheduled to be carried out by 24 October 2003 (also the deadline for submission of the project s minimum requirements) and was completed on schedule. As well as the initial stage of background research, research was ongoing throughout the course of the project. 10

16 Objectives 2 and 3 are closely related and are concerned with the extraction of phrases from a corpus and required the use of existing tools available within the School of Computing, as selected during the background research stage of the project. This stage identified and applied the tools most suitable to be used for the extraction of adjective-noun phrases and for identifying them as collocations. Although this stage related to the collection of data for the evaluation process it was, in fact, carried out prior to the development of the translation tool. This was to give a general grounding in the methods required and also to prepare the evaluation data so that this was ready as soon as the translation tool had been created. These two milestones were scheduled to be completed by the end of Week 11 of Semester 1, prior to the submission of the mid-project report and were completed on time. Objective 4, the implementation of the candidate generation tool, involved the greatest amount of practical work. Although a large amount of time for its completion was allocated, the schedule did not fully allow for the fact that the Christmas break would largely be taken up with exam preparation and no significant amount of work was completed on the project over this period. Consequently the development of the translation tool became delayed. In addition to this, problems with the implementation were encountered, as discussed in Chapter 4. This meant that although a date of 15 February 2004 was set for its completion it was not fully implemented until 15 March 2004 Objective 5, creating the program to rate translations was scheduled to be completed by 29 February 2004, yet this also became delayed due to the delay in completing Objective 4. However, this objective did not require as substantial an amount of work and could also be partly implemented whilst objective 4 was in progress. This meant that it was completed by 30 March 2004, behind schedule yet the amount of time the project was behind the initial schedule had not increased. It would also be impossible to complete objective 6 without first having completed all of the previous objectives. It had originally been intended to complete the entire evaluation procedure by 15 March However, at this date there was not any data to evaluate due to delays in completing prior objectives. The form of the evaluation could, however, be devised meaning that the evaluation could be carried out as quickly as possible once the data became available. All the results of the evaluation were eventually collected by 19 April Objective 7 involved firstly analysing other approaches (as completed in the background research) and then comparing with them to my work (in the evaluation.) This would form part of this final report and, thus, its scheduled completion date was the final deadline, 28 April Although I was not able to keep to the schedule towards the end of the project, I had allowed for a certain degree of slippage and there was up to a month of time allocated to this at the end of the 11

17 original project schedule. This ensured that, despite not fully adhering to the schedule, the project could still successfully be completed on time, within the deadline. Objective 4 was subject to the largest degree of delay but once this had been completed the project regained the correct working pace and the remaining objectives were accomplished within a time-scale similar to the original project plan. 12

18 4 Implementation of adjective-noun translation tool 4.1 Generation of candidate translations Searching a dictionary The initial task in the translation process was the generation of candidate translations using a program that exploits an existing machine readable English-German dictionary. This program will be called CANTRAN. Obtaining a suitable dictionary had proven to be a difficult task, however, a copy of the Oxford Duden German-English Dictionary (Oxford University Press, 1999) was eventually obtained and this would form the basis of my candidate translation tool. Due to copyright restrictions the dictionary could not be installed on School of Computing equipment and, therefore, the dictionary is read directly from a CD-ROM. The first step was to examine the structure of the dictionary. This would give me an insight into the most appropriate way to create a program that searched it. An example of a dictionary entry is : <e id=20968><hg><hw>music</hw></hg> <pr><ph>"mju:zik</ph></pr> <ps>n.</ps><s2 let=a><tr>musik,</tr> die; <ex>make <sd>music</sd>:</ex> <tr>musik machen;</tr> <tr>musizieren;</tr> <ex>student of <sd>music</sd>:</ex> <tr>musikstudent,</tr> der/<tr>studentin,</tr> die; <ex>piece of <sd>music</sd>:</ex> <tr>musikst&uu.ck,</tr> das; <tr>musik,</tr> die; <ex>set or put sth. to <sd>music</sd>:</ex> <tr>etw. vertonen od. in Musik setzen;</tr> <ex>have a gift for <sd>music</sd>:</ex> <tr>musikalisch begabt od. musikbegabt sein;</tr> <ex ty=idiom>be <sd>music</sd> to sb.'s ears</ex> <la>fig.</la> <tr>musik in jmds. Ohren sein</tr> <la>ugs.</la>; <ix>see also</ix> <xr><x><xh>face</xh> <xs>2 c</xs></x>; <x><xh>set</xh> <xs>1 s</xs></x>; <x><xh>sphere</xh> <xs>c</xs></x>;</xr></s2><s2 let=b><la>of waves, wind, brook</la> <tr>rauschen,</tr> das; <la>of birds</la> <tr>gesang,</tr> der;</s2><s2 let=c><la>score</la> <tr>noten</tr> pl.; <la>as merchandise also</la> <tr>musikalien</tr> pl.; <ex>sheet of <sd>music</sd>:</ex> <tr>notenblatt,</tr> das; <ex>play from <sd>music</sd>:</ex> <tr>nach Noten spielen</tr></s2></e> It can be seen how the headword (the word that requires translation) is enclosed between the <hw> tags and translations of that word are given between the <tr> tags. This is the only information required for the purposes of the translation tool, yet aside from this is given a great deal of other information such as examples or extra grammatical information. The task requires for a tool to be created that takes a headword as input and outputs the various words given as translations. Due to the nature of the translation process, only dictionary translations consisting of a single word will be used. It was decided that CANTRAN should be written using the Perl programming/scripting language. Although I was previously not highly experienced with programming in Perl I knew that it is a language well-suited to carrying out text processing tasks and also one which is relatively 13

19 straightforward to learn and understand. Saltzman (2002) describes Perl as extremely easy to learn and stresses this ease-of-use frequently. Perl is a language which is simple to use yet can be applied to complex tasks. In order to search for the correct headwords in the dictionary, it was necessary to make use of regular expressions. Perl has built in functions for working with regular expressions which could be exploited and this also contributed greatly to my decision to use Perl. According to Saltzman (2002) among all programming languages, none can match the power of regular expressions in Perl. The program was created to search for all cases of <hw>$term</hw> where $term is a variable that holds the string that is to be searched for and has been read in from a file. There are two such variables, to hold the adjective and noun that need to be translated respectively. For each line that contained a match (this should be only one line due to the fact that the dictionary is structured in such a way that each entry corresponds to a single line) the translations of the headword would be obtained by searching for all regular expressions matching /<tr>(.*)<\/tr>/ i.e. all instances where a single word is enclosed between the <tr> tags. Other characters such as commas or brackets would then need to be removed from the translated word and this could all be carried out through the use of regular expressions. This process was carried out for both terms in the adjective-noun phrase and each was passed into a separate array structure. After this all of the elements in each of the two arrays could be combined when outputting in order that all combinations of the two words be output. Upon testing this on a number of examples and analysing the output it was clear that a few issues needed to be resolved due to the ambiguous meaning of certain words. Although it was necessary to translate the different meanings of words I wanted to preserve the correct part of speech, i.e. only translate adjectives as adjectives and nouns as nouns. Yet, when looking at some of the output from CANTRAN it was clear this had not always been achieved and this was especially apparent due to the fact that, in German, all nouns begin with a capital letter. An example of this problem is the English word good which may be used as either an adjective or a noun. This issue was overcome by only accepting as translations adjectives starting with a lower case letter and nouns beginning with an upper case letter. After testing CANTRAN on a number of outputs there appeared to be something that had been previously overlooked and which could be used to extend the performance of the tool. German compound words are words made up of two or more words, for example the word Kernkraft composed of the nouns Kern and Kraft. When testing with the phrase nuclear power, one of 14

20 the translations of nuclear was Kern-, ending with a dash. This indicates that the word can be used as the first part of a compound word and followed by another noun. Because all terms like this in the dictionary end with a dash, the tool could be extended to check for all terms ending in a dash. Therefore, the program was altered so that when the adjective translation ended with a dash, the dash was removed and the adjective and noun translation were output together as a single word. Also, it should be noted that, for this particular case, the program was also altered to allow adjectives starting with a capital letter as an exception to the rule above. Some examples of dictionary entries and corresponding translations output by CANTRAN can be seen in Appendix C Handling Agreement The output from the translation program contains all the possible combinations of translations yet does not give all the information that is required to be passed into the Google search. Further complications arise because the German adjectives must agree with the noun in gender (and number). Consequently, this means that the endings required change for each noun and thus a way of automatically inflecting adjectives is required. For this project, the searches have been restricted to only search for nouns in the nominative case. In order to automatically generate the adjectival endings required a resource available in the School of Computing, CELEX (see section 2.3.5), will be used. The three CELEX database files that I require for forming the correct adjectival declinations are: German morphology, lemmas (gml.cd) This file gives morphological information on all the lemmas contained within the database. German syntax, lemmas (gsl.cd). This gives syntactical information on lemmas in the database. German morphology, wordforms (gmw.cd). This lists the different declinations of the lemmas from the other files which are cross-referenced using a unique number. In order to access the information contained within the fields in the files it was decided at first to use standard command line tools which could be applied to carry out text processing tasks. By using the command tr to replace all instances of a backslash by a space the text was left in columns separated by white space. The necessary entry can be found by using the grep command and the columns within the particular entry would be easily to manipulate using awk as within awk the variables $1, $2 15

21 and so on represent different columns which makes locating records within particular fields a simple task. The file gmw.cd lists all of the different wordforms of German words and, therefore, it contains all of the adjectival declinations possible for any adjective. The adjective itself, before any declination, is given in the gml.cd lemma file. What is therefore required is to identify the adjective lemma in gml.cd and match it with the corresponding wordforms as given in gmw.cd. This is possible because in CELEX each lemma is given a unique code and in the wordforms file all forms of the lemma are referenced by this code. This can be illustrated with the example of the German adjective jung. The corresponding entry in the lemmas files (gml.cd) is: 19108\jung\2236\M\1\Y\Y\Y\jung\A\N\N\N\(jung)[A]\N\N\N\N\I\N The first column (19108) is the unique lemma identifier and matching this value with the entries in the wordforms file gives: \juengst\18\19108\u \juengste\108\19108\u \juengsten\216\19108\u \juengster\65\19108\u \juengstem\0\19108\u \juengstes\5\19108\u \jung\194\19108\o \junge\485\19108\o \jungen\513\19108\o \junger\282\19108\o \jungem\4\19108\o \junges\74\19108\o \juenger\46\19108\c \juengere\70\19108\c \juengeren\126\19108\c \juengerer\28\19108\c \juengerem\0\19108\c \juengeres\2\19108\c8 Note how all entries in the fourth column are These are, therefore, all the possible forms of the lemma jung. In order to create the candidate translations we do not, however, need to use all of these wordforms. The final column (called FlectType ) shows how the word has been inflected. As the Web search is only carried out for phrases in the nominative case, adjectives will need to be declined with the endings e er and es depending on the gender of the particular noun. These endings refer to the entries that have o4, o6 and 08 in the final field as explained in the CELEX reference manual for the German part of the database. Ascertaining the gender of the noun is also a task that can be carried out using CELEX. The German syntax, lemmas file gives, in the fifth column, GendNum, a code corresponding to the gender of the 16

22 noun. Using a similar method as for that for the adjective this code is easily obtained for all nouns that are searched. A code of either 1, 2 or 3 is returned, corresponding to the genders masculine, feminine and neuter respectively. With this code, the requisite adjectival endings can be generated. I initially intended to carry out all of the work involving CELEX through the use of Unix shell scripts that simply called the appropriate commands. However, I had not had a great deal of experience in creating such scripts and found particular problems in, for example, linking up output returned from searching the different files. It appeared that using CELEX to generate the adjective endings might not be possible and the progress of the project was becoming slightly delayed. Without completing this part of the project it would not be possible to move on and complete the next stage as without having inflected the adjectives and thus generating the candidate translations, the final search process could not be carried out. There were alternative methods that could be considered rather than using the information contained within CELEX to generate the adjective endings. One possibility was simply to add the e to the end of all adjectives. This would then generate the correctly declined adjectives for all genders, using the nominative case and using the definite article. Alternatively, one could add the endings e, er and es to all adjectives, creating the correct endings for all genders, the nominative case and for cases where the article is either definite, indefinite or not present. One of the main drawbacks of this method is that it leads to a large amount of redundancy as combinations are generated that are not valid. This would not make an overall difference to the results as it would be expected that invalid combinations would not return any results when the search is carried out. However, it would introduce amounts of unnecessary computation and create practical problems if it was necessary to deal with such a large number of queries. Furthermore, the number of queries that a single user may carry out using the Google Web API in a single day is limited and, therefore, you would effectively be wasting available queries by submitting phrases that are not valid. A further problem in both of the above alternatives is that they only work for adjectives that are declined in the regular way and, thus, do not deal with the great number of exceptions there are to the standard rules. For example, for some adjectives ending with a vowel, the vowel is left out when the adjective is declined. Although it would not be excessively difficult to program in for these exceptions it would require a certain amount of coding and would still not produce definitive results due to the fact that there are still a number of exceptions to the exceptions. 17

23 I decided that it would therefore be best to try and persevere with a method that exploited the CELEX database and the fact that it only contains valid wordforms. I decided that rather than using shell scripts and Unix commands I would instead attempt to create a Perl program that was able to inflect the adjectives correctly. Eventually the program that I created, INFLECT, uses a combination of the two methods. That is, I created a Perl program which calls system commands. I decided to do this because the necessary awk and tr commands had already been generated so this would be the best way to move along with the project and avoid repeating work. I thus composed a Perl program that is able to read in from a file a German adjective and noun and, using system calls as above, find the correct gender of the noun and the correctly declined adjectives. Therefore, the output from a successful running of CANTRAN can be input to INFLECT and thus the correct candidate translations are created, ready to be submitted to the Web search. On testing the program there were a number of adjectives that were not found in CELEX and these would cause a problem as it would mean that no candidate translation would be put forward to the Web search. Therefore, for such terms I decided to use the alternative methods described above and simply added the applicable endings to each adjective, depending on the gender of the noun. This would work for the majority of cases. A further, final issue relates to the issue of compound words, as described in section These compound terms do not require any inflection and, therefore, should be treated differently when they are passed to INFLECT. To deal with these, the program was set up so that any translations generated by CANTRAN which consisted of only a single word are not altered when output by INFLECT. 4.2 Translation selection using Web search The final stage in the implementation process involved submitting the candidate translations that have been generated using CANTRAN and INFLECT to a search engine in order to ascertain the best translation according to the one which receives the greatest number of hits. Carrying out such a search by hand would not be a practical task so I will use the Google Web API provided by Google in order to carry out the search automatically. This tool facilitates direct querying of the Google Web directory from within a program written in the language of your choice. It was again decided to create the Google search program would be created in Perl. After having programmed the translation tool in Perl I felt more confident in using the language and it would certainly be advantageous for both programs to be created in the same programming language. 18

24 The construction of the Google search program was not complicated. Writing it simply involved referencing the correct libraries for the Google Web API and setting up a search that returns the total number of hits returned from the query that is submitted. When testing the program an issue that was raised in that the number of results returned from a search using the Web API did not correspond to the number of results when the search was carried out manually using the Google website. Initially it was believed that this was possibly due to an error with the program created but checking of the program did not appear to show any obvious errors in the program and it appeared that the query was set up in the correct way. The Frequently Asked Questions on the Google website did not give any further insight into the problem. However, when looking at the Google Web API discussion groups, also on the Google website ( it became apparent that the problem was not a unique one to myself or a problem with the program that had been created but a general issue with the tool that had been experienced by a number of users. As this was the case there was nothing that could be done to overcome the problem. Although the actual numbers varied between searches from the Google Website and using the API, the relative frequencies between different queries remained the same, meaning that the selection of the best translation should remain unaffected. Also, it was suggested by some users of the discussion groups that the figures quoted by the API are actually more accurate than those provided by the Web search, although this claim cannot be fully substantiated. The program that has been created is able to read in from a file any number of candidate translations and submit them as queries to the Google Web search. Once all candidate translations have been searched, the program ranks the search results according to the number of hits. The top-ranking query is then taken as the correct translation. Where the candidate translation consists of differently inflected forms of the same translation, the results for each part were summed to give an overall figure for each phrase. 19

25 5 Collocation Finder 5.1 Extraction of adjective-noun phrases from a corpus For the evaluation of the final translation tool a number of adjective-noun phrases to be translated must be selected. These were to be extracted from the British National Corpus. The query used to extract from the BNC all the adjective-noun phrases using the Gsearch tool as discussed in section is given below: gsearch M nc bnc_10../demo/grammarbnc <tag = AJ0> < tag = NN1> Options: -M : to search for more than one adjective noun sequence per sentence in the BNC -nc : to give no context data about individual words bnc_10 : the corpus to search, here 10% of the BNC../demo/GrammarBNC : the grammar to use in the search tag = AJ0 : to search for, firstly, adjectives, excluding superlatives and comparatives tag = NN1 : to search for singular nouns (not proper nouns) Originally it was intended to search the entire BNC (100 million words) but this was not possible in practice due to the size of the output file when the entire corpus was used being far too large for my disk quota. For the purposes of this project the corpus size, at 10 million words, is still sufficiently large enough to get a good representative sample. This is shown by the fact that the number of adjective-noun combinations returned from the search was 290,700. This is the number of actual number of combinations returned (the tokens) and, at this stage, contains a number of duplicated terms. The output from the gsearch query was a file containing the collocations but also a lot of formatting data. To leave only the adjective-noun combinations which I required, a C++ program was created. This was a short program which simply reads in from a file the entire gsearch output and outputs only the actual adjective-noun phrases. Due to the fact that the text was formatted in such a way that the line required appeared on lines numbered in multiples of a constant, creating the necessary program was fairly trivial. Checking the output was as expected was also straightforward as this could be achieved by simply checking if the number of lines in the output file corresponded to the number of adjective-noun combinations returned by the gsearch query. 20

26 5.2 Selecting collocations by frequency Once the list of adjective-noun combinations had been created it was necessary to order it by frequency to see which were the most common adjective-noun phrases and, thus, most likely to be collocations. To do this, the unix core tools uniq and sort were used. Firstly, the file was sorted alphabetically using sort and then uniq c was run to remove all duplicate phrases and, in the case of duplicates, the one remaining instance of the phrase was preceded with a number, showing the frequency with which it occurred. Following this, the phrases were arranged in frequency order using the command sort nr, the n option telling it to sort in numerical order and the r option to reverse the order, i.e. putting the most frequent phrases at the top. Following the removal of duplicates the number of unique adjective-noun combinations was 161,810 (the number of word types). For the purposes of comparison, as well as using the most frequent phrases, the whole candidate translation generation and selection procedure was also carried out with some phrases that occurred less often. Therefore, it was decided to also look at phrases that occurred in the BNC with a frequency of 2. This was done because I wanted to use phrases that occurred with a low frequency but not those that only occurred once as many of these would be liable to be obscure phrases for which the dictionary might not provide any suitable translations. In the 10% subsection of the BNC that I used there are 18,265 adjective-noun phrases that occur with a frequency of 2. Thus, the final list of phrases (attached as Appendix E) to be translated consists of the top 50 most frequent phrases and 50 random phrases chosen from the phrases which occurred twice. Of the 50 most common phrases it was decided that 40 were actually collocations and from those that occurred twice this figure stood at 22. Deciding on whether or not these were actually collocations involved analysing them using the definitions and criteria detailed in section Despite the presence of the definitions and criteria, determining whether the phrases were, in fact, collocations remained a non-trivial and somewhat subjective task. In analysing the effectiveness of the collocation finder we are testing the hypothesis that frequency is an effective way of selecting collocations. Manning and Schuetze (1999) state that the frequency method, when filtered by part of speech (as had been done by selecting only adjective-noun phrases), gives surprisingly good results. This is backed up by my findings as the proportion of collocational phrases in the set of high frequency phrases (40/50) is statistically significantly higher than the proportion in the lower frequency phrases (22/50) when carrying out the t-test at the 5% level of significance. 21