Using Web Search for Machine Translation Nicolas Wehmeier BSc Computing and German 2003/2004

Size: px
Start display at page:

Download "Using Web Search for Machine Translation Nicolas Wehmeier BSc Computing and German 2003/2004"

Transcription

1 Using Web Search for Machine Translation Nicolas Wehmeier BSc Computing and German 2003/2004 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Summary Machine translation is the process of using a computer to translate text from one language into another. This project explores and evaluates the use of the World Wide Web as a resource for example-based machine translation tasks. The aim of this project is to create a translation tool that uses the World Wide Web as a source of linguistic examples to translate adjective-noun phrases from English into German. An evaluation process carried out by native speakers of German judges the effectiveness of the tool. I

3 Acknowledgements Firstly, I would like to thank Ann Lawson at Oxford University Press for kindly allowing me the use of the Oxford-Duden German dictionary, without which the project could not have been completed. Thanks also go to Haiko Müller, Thomas Schneider and Jürgen Wehmeier for being my native speakers of German and carrying out the translation evaluations. Finally, and most importantly, I would like to thank Katja Markert for all of her help and guidance throughout the project. II

4 Contents Summary...I Acknowledgements...II Contents...III 1. Introduction 1.1 Aims and introduction minimum requirements Background Research 2.1 Machine Translation Evaluation of machine translation The Web for machine translation Collocations Collocation definitions Collocational criteria Selecting collocations from corpora Tools and resources British National Corpus Gsearch Unix core tools Bilingual dictionary CELEX lexical database Regular Expressions Methodology and Project Management 3.1 Objectives Project Management Implementation of adjective noun-translation tool 4.1 Generation of candidate translations Searching a dictionary Handling agreement Translation selection using Web search Collocation Finder Extraction of adjective-noun phrases from a corpus Selecting collocations by frequency...21 III

5 6. Evaluation 6.1 Translation of the adjective-noun phrases Evaluation procedure Designing the evaluation Evaluation results and interpretation Fluency Fidelity Agreement and disagreement between evaluators Conclusions Summary Comparison to similar research Possibilities for improvement...30 Bibliography...31 Appendices A Personal Reflection...33 B Example of the translation procedure...35 C Example of dictionary entries and output from CANTRAN...36 D Example output from INFLECT...39 E Adjective-noun phrases...41 F Translations and number of hits from Web search...42 G Evaluation form and results...44 IV

6 1 Introduction 1.1 Aims and Introduction This project aims to investigate the use of the World Wide Web as a resource for machine translation. An example-based machine translation tool is to be developed that exploits the huge amount of information contained on the World Wide Web. For this project the tool will be developed to be used specifically to translate adjective-noun combinations. As an example to highlight the size of the Web, compare the counts for the following randomly selected adjective-noun phrases as found on the Web to the frequencies from a corpus available in the School of Computing, the British National Corpus (BNC). Column two in Table 1.1 gives the frequencies as found in the BNC (a large corpus). The third column gives the number of results of the same phrase as returned by the WWW search engine Google ( Phrase Google BNC old gelding 19,100 2 addictive behaviour 9,920 3 accurate reflection 81, easy access 3,780, positive feedback 717, Figure 1.1 Table showing counts of random adjective-noun phrases found in the British National Corpus and on the World Wide Web using the Google search engine in April As can be seen in each case, the number of hits returned from the Web search is far greater than the number of occurrences in the BNC, showing the potential of the Web as a resource for obtaining linguistic examples. This project aims to see whether these vast numbers of examples can be exploited and applied as a basis for machine translation tasks. The specific translation process involved is a two-stage process. Firstly, the tool will generate candidate translations of an adjective-noun phrase. This is achieved by translating each individual word in the phrase with the use of an existing dictionary. Combining all of the adjective translations with all of the noun translations forms candidate translations for the adjective-noun phrase. Secondly, the candidate translations are submitted to a World Wide Web search engine and the correct translation is selected, based on the number of results from the search. It is assumed that the phrase 1

7 which receives the highest number of hits is the most common one and, therefore, most likely to be the best translation. An example of the overall procedure, simplified and carried out by hand, prior to the development of the translation tool, can be seen in Appendix B. Although the design of the system will allow it to be used to provide potential translations for all adjective-noun phrases, a further aim of the project is to evaluate how the tool behaves when applied specifically to collocational phrases. Therefore, a method of selecting a number of collocational phrases to be translated will be required. Additionally, some phrases that are potentially less collocational will also be translated, as a means of comparison. 1.2 Minimum requirements The following are, therefore, the minimum requirements of the project: A simple frequency-based finder for adjective-noun collocations in English corpora Use machine readable dictionaries to obtain translations for English adjective-noun combinations into German A program that uses the google API to choose between several translations based on simple frequency Statistical evaluation of the whole procedure based on a guided evaluation session to be completed independently by two native speakers of German A comparison to the approaches used in literature 2

8 2 Background Research 2.1 Machine Translation Jurafsky and Martin (2000) define machine translation as the use of computers to automate some or all of the process of translating from one language to another. Translation, of any sort, is often viewed as an art form in itself and human translators require a high degree of skill to produce accurate translations. Nevertheless, according to Kay (1997), there is much about [translation] that is mechanical and routine. Since the earliest development of computer development attempts have been made to harness computing power and apply it to translation tasks. Machine translation has developed greatly since its inception and Papineni et al (2002) believe that machine translation progress stems from evaluation and evaluating the quality of translations provided by any system certainly remains an important task. Only by recognising where and why a tool succeeds (and fails) can we gain an idea of how to improve the system Evaluation of machine translation Translation remains an intensely human endeavour (Jurafksy and Martin, 2000) yet some of the tasks may be carried out by computers and studies have attempted to analyse and evaluate the quality of translations as provided by machine translation approaches. Carroll (1966) states that a translation, whether mechanical or human, is characterised by (a) its intelligibility, and (b) its fidelity to the sense of the original text and these appear to be the major criteria to evaluate when looking at the quality of translations. The intelligibility (also referred to as the fidelity ) of a translation relates to how intelligible and credible the translated text is in the target language. This evaluation criterion deliberately ignores the original source-language phrase, instead it is tested independently and in his experiment Carroll states that when the evaluators measured intelligibility they did so without reference to the original. For his method of evaluation, Carroll devised scales with which evaluators could rate the quality of translations, for both fluency and fidelity. The fluency scale is such that I would be able to apply it when creating the evaluation for my own study. However, the scale used by Carroll to rate fidelity is not appropriate to my work. His scale does not rate fidelity directly, but rather scores the translation by assigning an informativeness value to the original text when compared to the translation. This is designed to see how much extra information is gained when the original text is seen, and thereby show how much information from the original has been left out in the translation. This type of test is 3

9 better suited to large amounts of text and would not work well when applied to the simple adjectivenoun combinations translated without any contextual information in my study. Other studies for the evaluation of translated texts are also available and a scale devised by Nagao et al (1985) is reproduced in Benoit et al (1993). Nagao et al designed a seven point scale to be used for judging the accuracy or fidelity of translations. In presenting Nagao s scale Benoit et al again stress the importance of separating the different parts of the evaluation procedure to ensure that the evaluators judgements were not influenced by the correct translation. A further important issue presented by Benoit et al concerns considering the intended users of the translated text and analysing their reasons for conducting a machine translation. Some users may be interested merely in ascertaining the general subject matter of a text or gaining an approximation of the content rather then being concerned with obtaining a fully grammatically correct translation. In a further proposed method of evaluating the quality of translations, Papineni et al (2002) propose an automatic method of evaluation, known as BLEU (bilingual evaluation understudy). This proposed method would be automatic, quick and language-independent, correlating highly with human evaluation techniques. Although this method of evaluation is of interest, it is less applicable to the study that is being carried out here. It is primarily designed to be used when quick and frequent evaluations are required, such as when developing a full machine translation system. A more thorough method of evaluation, that is only carried out on a single occasion, is more appropriate for an evaluation of the method I am investigating The Web for machine translation The piece of research that is most similar to that which I am carrying out is by Grefenstette (1999). In this paper he shows how it is possible to use the Web as a corpus for translation purposes. A major difference between my work and that of Grefenstette is that he carried out the work for noun-noun phrases whereas mine is concerned with adjective-noun phrases. Grefenstette used compounds and multi-word phrases found in German and Spanish and translated their subparts individually to create a large number of candidate translations for the phrase. However, Grefenstette only selected compositional noun phrases that fitted with his particular criteria so that the search would be more likely to yield good results. He only selected compounds that could be decomposed into two words in the source languages and whose English translations consisted of two words. My study uses automatically selected items which should test more effectively how well the Web translation method actually works. If, as in the case of Grefenstette, you select the phrases that will work well anyway then the study becomes obviously biased. 4

10 Following the generation of candidate translations of the noun-noun phrases Grefenstette then selected the best translation by entering the candidates into a Web search engine (in this case AltaVista) and then choosing the most commonly occurring translation, thereby testing the theory that the correct translation is the most frequently occurring one. My method differs slightly in that I have created an automatic translation tool with the intention being that any adjective-noun phrase can be entered and the correct translation generated. Grefenstette s approach yields favourable results in that for German 87% of the time the most common search engine results gave the correct translation, as tested by comparing with the dictionary translation. For Spanish the figure is 86%. My method of evaluation does not involve comparison with a dictionary but an independent evaluation process with native speakers of German. Although his method is perhaps in some respects a little biased Grefenstette does prove that the World Wide Web is a useful source for linguistic examples even though he acknowledges that some linguists cringe at the idea of using this uncharacterised and dirty corpus to derive linguistic information. Another paper that discusses the use of the Web as a resource for linguistic tasks is by Volk (2002) who says that "the World Wide Web can be seen as the largest corpus ever" and that through searching the Web one can exploit the frequency information. Volk s paper is useful in that it provides numerous examples of the many and varied ways in which the Web has been used as a resource for computational linguists. The rise of the use of the Web as a resource for linguists is emphasised by the fact that the September 2003 issue of the journal of the Association for Computational Linguistics was a special issue dedicated to the Web as a corpus. Kilgarriff and Grefenstette (2003) gave an introduction to it and therein it is stated that language scientists and technologists are increasingly turning to the Web as a source of language data. As an answer to the question of whether or not the Web is a corpus, Kilgarriff and Grefenstette say yes. 2.2 Collocations Collocation definitions To help form an idea of what is meant by the term collocation I consulted some linguistics text books (Palmer, 1981) and (Leech, 1981). 5

11 Palmer (1981) describes collocative meaning as consisting of the association a word acquires on account of the meanings of words which tend to occur in its environments. From studying the literature it certainly appears that this is an important aspect of what a collocation is. Words, when placed together, can take on a meaning which cannot necessarily be deduced from the individual words. This concept is known as non-compositionality. Manning and Schuetze (1999) also give a definition of collocation and state a collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. This shows that collocations often involve idiom which explains why it is often so difficult to translate them Collocational criteria As well as simply defining the meaning of the word collocation, Manning and Schuetze identify three criteria that can be used to test whether a phrase is likely to be a collocation: Non-compositionality: The meaning of a collocation is not a straight forward composition of the meaning of its part. Non-substitutability : We cannot substitute other words for the components of a collocation even if, in context, they have the same meaning. Non-modifiability: Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. When it came to identifying the collocations from the list of candidate phrases, these are the basic criteria which were used to base my decisions upon. However, these criteria cannot be seen as definitive and the selection of collocations remained, in some instances, quite a demanding task and one that, by its nature, involved a reasonable amount of subjective decision making. The fact that the literature debates the subject of selecting collocations so frequently shows that there is no general consensus on the true meaning of the word collocation Selecting collocations from corpora The literature also contains information very relevant to my project on identifying collocations in a text, especially the frequency method which is the method that I will be using. On this subject Manning and Schuetze write that if two words occur together, then that is evidence that they have a special function that is not simply explained as the function that results from their combination. 6

12 Although it is shown in literature that the frequency method is an effective way of finding collocations, other methods are also written about in the literature. For example, one method is to see if the number of times that two words occur together is significantly higher than the number that would occur by pure chance. This method, known as hypothesis testing, involves looking at the probability of two words occurring together and then carrying out significance testing using predefined testing values to see if words that occur together frequently are doing so because they are collocations or just because the words individually occur frequently and, thus, there is a high chance that they will occur together frequently. 2.3 Tools and resources: British National Corpus An essential resource required to carry out this project is a suitable corpus from which the phrases to be translated can be extracted. Without this the project could not be carried out at all. A corpus available in the School of Computing is the British National Corpus (BNC) and this is an extremely valuable resource, not least due to its size, as it contains 100 million words (90 million from written text, 10 million from spoken text). The size of this resource makes it a very useful to give a fair representation of language used in the English-speaking world. Because the BNC is encoded in SGML it makes it very simple to search through with the correct tools. The BNC is tagged by part of speech which makes it easy to search for and extract particular patterns of words; in my case all instances of an adjective followed by a noun. I had to spend some time studying the documentation provided for the BNC in order that I understood some of the background to it and also to see which tags were used for which parts of speech so that I would successfully be able to search for the phrases I required Gsearch A tool that is required for the collocation finding part of the project is the gsearch tool. This is a program that searches through encoded texts and can extract from them only the parts that are required i.e. for this project all cases where a noun is preceded by an adjective. The advantage of the program is obvious as it allows automatic searching of very large texts in a reasonably fast time in a way that would be impossible to carry out by hand. In order to search the BNC correctly using gsearch it was necessary for me to study the documentation provided with the program (Corley et al, 2001). Although not necessarily overly complicated the syntax for formulating queries did need to be studied in order that my search was carried out correctly as I wished. This is particularly important 7

13 when considering the size of the BNC and the resulting fact that searches take significant amounts of time to complete Unix core tools For ordering the phrases output by the gsearch program I required the standard Unix text processing tools uniq and sort. Church provides some relevant documentation, giving useful examples of ways to effectively use them with regard to natural language processing tasks Bilingual Dictionary In order to generate candidate translations an existing, machine-readable German-English dictionary is required. Ideally the dictionary to be used will be of a large size in order to provide a range of translations for each word. Initial research into obtaining a suitable dictionary proved somewhat unsuccessful. Although a number of dictionaries are available in electronic form are free of charge, the underlying data has not been made available to the public. These are, therefore, of no use when trying to carry out automated tasks with them. One alternative tool available was a word list, available online. Although it would have been possible to have used this to form translations it would not have been highly satisfactory as the number of translations for each word was not very high. An advantage of such a word list, however, is that it only contains single-word translations for each word and this would simplify the translation process. The dictionary that was eventually obtained is the SGML encoded version of The Oxford-Duden German-English Dictionary (Oxford University Press, 1999) This is a high-quality resource to have available and is described by the publishers as being the product of nearly ten years of joint work by two of the world's foremost dictionary publishers. We know that, this being a commercial product, it is likely have been subject to a large degree of research and quality control and is likely to provide accurate and complete translations CELEX lexical database CELEX (CEntre for LEXical Information) (Baayen et al, 1995) is a lexical database developed at Nijmegen in the Netherlands and holds information about words in the English, German and Dutch languages. 8

14 This information contained within CELEX can be applied for inflecting adjectives when creating candidate translations because it stores the various wordforms which are connected to the uninflected lemmas. Therefore, by knowing the lemma, the requisite wordform can be generated, depending on the situation. CELEX consists of individual database files that are plain ASCII text files with each entry occurring on a separate line and divided into fields with a backslash separating each field. The following is an example of the format, here an extract from the file gml.cd: 1\A\563\M\1\Y\Y\Y\A\N\N\N\N\(A)[N]\N\N\N\N\S3/P2\N 2\Ae\4\M\1\Y\Y\Y\Ae\N\N\N\N\(Ae)[N]\N\N\N\N\S3/P2\N 3\aalen\1\Z\1\Y\Y\Y\Aal\N\N\N\N\((Aal)[N])[V]\N\N\N\N\r1\N 4\Aal\80\M\1\Y\Y\Y\Aal\N\N\N\N\(Aal)[N]\N\N\N\N\S1/P1\N Without any accompanying documentation the entries do not appear to provide a great deal of information so the accompanying documentation was consulted in order to ascertain what is represented by each field in each file. The CELEX user manual for German (Baayen at al, 1995) gives a detailed description of all that is covered by the database and each file is also accompanied by a README which gives the meaning of each field in the file Regular Expressions In designing the translation tool use must be made of regular expressions as a means of searching through texts. Ramsay presents a guide outlining the concepts involved and providing numerous examples to illustrate their application. 9

15 3 Methodology and Project Management 3.1 Objectives The methodology for this project relates closely to the objectives of the project, as detailed in the midproject report. These are reproduced below: 1. To carry out background research relevant to the project area. 2. To search a corpus and obtain from it a list of adjective-noun phrases. 3. To select from these phrases the ones most likely to be collocational, based on frequency (the automatic collocation finder) and collocational criteria. 4. To use a dictionary to obtain candidate translations of these phrases, based on translations of the individual words. 5. To create a program, using the google API, that uses the translation and selects the most appropriate translation, based on the number of hits from the web search. 6. To give instructions for two native speakers of German to independently evaluate the effectiveness of translations provided by the tool. 7. To compare the approach with other approaches in literature. This project is not primarily concerned with the development of an application but rather with the analysis of a method of performing machine translation. Therefore, no specific existing methodology is easily applied to this type of project. Rather, it was most important to select the correct tools to carry out the tasks required 3.2 Project Management Each of the objectives above relate to specific milestones that needed to be completed in order to facilitate progress with the project and they, therefore, lend themselves naturally to the formulation of a schedule. Objective 1 was carried out prior to the commencement of any practical work to give the necessary background knowledge required and gain an appreciation of the tools involved. This was scheduled to be carried out by 24 October 2003 (also the deadline for submission of the project s minimum requirements) and was completed on schedule. As well as the initial stage of background research, research was ongoing throughout the course of the project. 10

16 Objectives 2 and 3 are closely related and are concerned with the extraction of phrases from a corpus and required the use of existing tools available within the School of Computing, as selected during the background research stage of the project. This stage identified and applied the tools most suitable to be used for the extraction of adjective-noun phrases and for identifying them as collocations. Although this stage related to the collection of data for the evaluation process it was, in fact, carried out prior to the development of the translation tool. This was to give a general grounding in the methods required and also to prepare the evaluation data so that this was ready as soon as the translation tool had been created. These two milestones were scheduled to be completed by the end of Week 11 of Semester 1, prior to the submission of the mid-project report and were completed on time. Objective 4, the implementation of the candidate generation tool, involved the greatest amount of practical work. Although a large amount of time for its completion was allocated, the schedule did not fully allow for the fact that the Christmas break would largely be taken up with exam preparation and no significant amount of work was completed on the project over this period. Consequently the development of the translation tool became delayed. In addition to this, problems with the implementation were encountered, as discussed in Chapter 4. This meant that although a date of 15 February 2004 was set for its completion it was not fully implemented until 15 March 2004 Objective 5, creating the program to rate translations was scheduled to be completed by 29 February 2004, yet this also became delayed due to the delay in completing Objective 4. However, this objective did not require as substantial an amount of work and could also be partly implemented whilst objective 4 was in progress. This meant that it was completed by 30 March 2004, behind schedule yet the amount of time the project was behind the initial schedule had not increased. It would also be impossible to complete objective 6 without first having completed all of the previous objectives. It had originally been intended to complete the entire evaluation procedure by 15 March However, at this date there was not any data to evaluate due to delays in completing prior objectives. The form of the evaluation could, however, be devised meaning that the evaluation could be carried out as quickly as possible once the data became available. All the results of the evaluation were eventually collected by 19 April Objective 7 involved firstly analysing other approaches (as completed in the background research) and then comparing with them to my work (in the evaluation.) This would form part of this final report and, thus, its scheduled completion date was the final deadline, 28 April Although I was not able to keep to the schedule towards the end of the project, I had allowed for a certain degree of slippage and there was up to a month of time allocated to this at the end of the 11

17 original project schedule. This ensured that, despite not fully adhering to the schedule, the project could still successfully be completed on time, within the deadline. Objective 4 was subject to the largest degree of delay but once this had been completed the project regained the correct working pace and the remaining objectives were accomplished within a time-scale similar to the original project plan. 12

18 4 Implementation of adjective-noun translation tool 4.1 Generation of candidate translations Searching a dictionary The initial task in the translation process was the generation of candidate translations using a program that exploits an existing machine readable English-German dictionary. This program will be called CANTRAN. Obtaining a suitable dictionary had proven to be a difficult task, however, a copy of the Oxford Duden German-English Dictionary (Oxford University Press, 1999) was eventually obtained and this would form the basis of my candidate translation tool. Due to copyright restrictions the dictionary could not be installed on School of Computing equipment and, therefore, the dictionary is read directly from a CD-ROM. The first step was to examine the structure of the dictionary. This would give me an insight into the most appropriate way to create a program that searched it. An example of a dictionary entry is : <e id=20968><hg><hw>music</hw></hg> <pr><ph>"mju:zik</ph></pr> <ps>n.</ps><s2 let=a><tr>musik,</tr> <i>die;</i> <ex>make <sd>music</sd>:</ex> <tr>musik machen;</tr> <tr>musizieren;</tr> <ex>student of <sd>music</sd>:</ex> <tr>musikstudent,</tr> <i>der</i>/<tr>studentin,</tr> <i>die;</i> <ex>piece of <sd>music</sd>:</ex> <tr>musikst&uu.ck,</tr> <i>das;</i> <tr>musik,</tr> <i>die;</i> <ex>set <i>or</i> put sth. to <sd>music</sd>:</ex> <tr>etw. vertonen <i>od.</i> in Musik setzen;</tr> <ex>have a gift for <sd>music</sd>:</ex> <tr>musikalisch begabt <i>od.</i> musikbegabt sein;</tr> <ex ty=idiom>be <sd>music</sd> to sb.'s ears</ex> <la>fig.</la> <tr>musik in jmds. Ohren sein</tr> <la>ugs.</la>; <ix>see also</ix> <xr><x><xh>face</xh> <xs>2 c</xs></x>; <x><xh>set</xh> <xs>1 s</xs></x>; <x><xh>sphere</xh> <xs>c</xs></x>;</xr></s2><s2 let=b><la>of waves, wind, brook</la> <tr>rauschen,</tr> <i>das;</i> <la>of birds</la> <tr>gesang,</tr> <i>der;</i></s2><s2 let=c><la>score</la> <tr>noten</tr> <i>pl.;</i> <la>as merchandise also</la> <tr>musikalien</tr> <i>pl.;</i> <ex>sheet of <sd>music</sd>:</ex> <tr>notenblatt,</tr> <i>das;</i> <ex>play from <sd>music</sd>:</ex> <tr>nach Noten spielen</tr></s2></e> It can be seen how the headword (the word that requires translation) is enclosed between the <hw> tags and translations of that word are given between the <tr> tags. This is the only information required for the purposes of the translation tool, yet aside from this is given a great deal of other information such as examples or extra grammatical information. The task requires for a tool to be created that takes a headword as input and outputs the various words given as translations. Due to the nature of the translation process, only dictionary translations consisting of a single word will be used. It was decided that CANTRAN should be written using the Perl programming/scripting language. Although I was previously not highly experienced with programming in Perl I knew that it is a language well-suited to carrying out text processing tasks and also one which is relatively 13

19 straightforward to learn and understand. Saltzman (2002) describes Perl as extremely easy to learn and stresses this ease-of-use frequently. Perl is a language which is simple to use yet can be applied to complex tasks. In order to search for the correct headwords in the dictionary, it was necessary to make use of regular expressions. Perl has built in functions for working with regular expressions which could be exploited and this also contributed greatly to my decision to use Perl. According to Saltzman (2002) among all programming languages, none can match the power of regular expressions in Perl. The program was created to search for all cases of <hw>$term</hw> where $term is a variable that holds the string that is to be searched for and has been read in from a file. There are two such variables, to hold the adjective and noun that need to be translated respectively. For each line that contained a match (this should be only one line due to the fact that the dictionary is structured in such a way that each entry corresponds to a single line) the translations of the headword would be obtained by searching for all regular expressions matching /<tr>(.*)<\/tr>/ i.e. all instances where a single word is enclosed between the <tr> tags. Other characters such as commas or brackets would then need to be removed from the translated word and this could all be carried out through the use of regular expressions. This process was carried out for both terms in the adjective-noun phrase and each was passed into a separate array structure. After this all of the elements in each of the two arrays could be combined when outputting in order that all combinations of the two words be output. Upon testing this on a number of examples and analysing the output it was clear that a few issues needed to be resolved due to the ambiguous meaning of certain words. Although it was necessary to translate the different meanings of words I wanted to preserve the correct part of speech, i.e. only translate adjectives as adjectives and nouns as nouns. Yet, when looking at some of the output from CANTRAN it was clear this had not always been achieved and this was especially apparent due to the fact that, in German, all nouns begin with a capital letter. An example of this problem is the English word good which may be used as either an adjective or a noun. This issue was overcome by only accepting as translations adjectives starting with a lower case letter and nouns beginning with an upper case letter. After testing CANTRAN on a number of outputs there appeared to be something that had been previously overlooked and which could be used to extend the performance of the tool. German compound words are words made up of two or more words, for example the word Kernkraft composed of the nouns Kern and Kraft. When testing with the phrase nuclear power, one of 14

20 the translations of nuclear was Kern-, ending with a dash. This indicates that the word can be used as the first part of a compound word and followed by another noun. Because all terms like this in the dictionary end with a dash, the tool could be extended to check for all terms ending in a dash. Therefore, the program was altered so that when the adjective translation ended with a dash, the dash was removed and the adjective and noun translation were output together as a single word. Also, it should be noted that, for this particular case, the program was also altered to allow adjectives starting with a capital letter as an exception to the rule above. Some examples of dictionary entries and corresponding translations output by CANTRAN can be seen in Appendix C Handling Agreement The output from the translation program contains all the possible combinations of translations yet does not give all the information that is required to be passed into the Google search. Further complications arise because the German adjectives must agree with the noun in gender (and number). Consequently, this means that the endings required change for each noun and thus a way of automatically inflecting adjectives is required. For this project, the searches have been restricted to only search for nouns in the nominative case. In order to automatically generate the adjectival endings required a resource available in the School of Computing, CELEX (see section 2.3.5), will be used. The three CELEX database files that I require for forming the correct adjectival declinations are: German morphology, lemmas (gml.cd) This file gives morphological information on all the lemmas contained within the database. German syntax, lemmas (gsl.cd). This gives syntactical information on lemmas in the database. German morphology, wordforms (gmw.cd). This lists the different declinations of the lemmas from the other files which are cross-referenced using a unique number. In order to access the information contained within the fields in the files it was decided at first to use standard command line tools which could be applied to carry out text processing tasks. By using the command tr to replace all instances of a backslash by a space the text was left in columns separated by white space. The necessary entry can be found by using the grep command and the columns within the particular entry would be easily to manipulate using awk as within awk the variables $1, $2 15

21 and so on represent different columns which makes locating records within particular fields a simple task. The file gmw.cd lists all of the different wordforms of German words and, therefore, it contains all of the adjectival declinations possible for any adjective. The adjective itself, before any declination, is given in the gml.cd lemma file. What is therefore required is to identify the adjective lemma in gml.cd and match it with the corresponding wordforms as given in gmw.cd. This is possible because in CELEX each lemma is given a unique code and in the wordforms file all forms of the lemma are referenced by this code. This can be illustrated with the example of the German adjective jung. The corresponding entry in the lemmas files (gml.cd) is: 19108\jung\2236\M\1\Y\Y\Y\jung\A\N\N\N\(jung)[A]\N\N\N\N\I\N The first column (19108) is the unique lemma identifier and matching this value with the entries in the wordforms file gives: \juengst\18\19108\u \juengste\108\19108\u \juengsten\216\19108\u \juengster\65\19108\u \juengstem\0\19108\u \juengstes\5\19108\u \jung\194\19108\o \junge\485\19108\o \jungen\513\19108\o \junger\282\19108\o \jungem\4\19108\o \junges\74\19108\o \juenger\46\19108\c \juengere\70\19108\c \juengeren\126\19108\c \juengerer\28\19108\c \juengerem\0\19108\c \juengeres\2\19108\c8 Note how all entries in the fourth column are These are, therefore, all the possible forms of the lemma jung. In order to create the candidate translations we do not, however, need to use all of these wordforms. The final column (called FlectType ) shows how the word has been inflected. As the Web search is only carried out for phrases in the nominative case, adjectives will need to be declined with the endings e er and es depending on the gender of the particular noun. These endings refer to the entries that have o4, o6 and 08 in the final field as explained in the CELEX reference manual for the German part of the database. Ascertaining the gender of the noun is also a task that can be carried out using CELEX. The German syntax, lemmas file gives, in the fifth column, GendNum, a code corresponding to the gender of the 16

22 noun. Using a similar method as for that for the adjective this code is easily obtained for all nouns that are searched. A code of either 1, 2 or 3 is returned, corresponding to the genders masculine, feminine and neuter respectively. With this code, the requisite adjectival endings can be generated. I initially intended to carry out all of the work involving CELEX through the use of Unix shell scripts that simply called the appropriate commands. However, I had not had a great deal of experience in creating such scripts and found particular problems in, for example, linking up output returned from searching the different files. It appeared that using CELEX to generate the adjective endings might not be possible and the progress of the project was becoming slightly delayed. Without completing this part of the project it would not be possible to move on and complete the next stage as without having inflected the adjectives and thus generating the candidate translations, the final search process could not be carried out. There were alternative methods that could be considered rather than using the information contained within CELEX to generate the adjective endings. One possibility was simply to add the e to the end of all adjectives. This would then generate the correctly declined adjectives for all genders, using the nominative case and using the definite article. Alternatively, one could add the endings e, er and es to all adjectives, creating the correct endings for all genders, the nominative case and for cases where the article is either definite, indefinite or not present. One of the main drawbacks of this method is that it leads to a large amount of redundancy as combinations are generated that are not valid. This would not make an overall difference to the results as it would be expected that invalid combinations would not return any results when the search is carried out. However, it would introduce amounts of unnecessary computation and create practical problems if it was necessary to deal with such a large number of queries. Furthermore, the number of queries that a single user may carry out using the Google Web API in a single day is limited and, therefore, you would effectively be wasting available queries by submitting phrases that are not valid. A further problem in both of the above alternatives is that they only work for adjectives that are declined in the regular way and, thus, do not deal with the great number of exceptions there are to the standard rules. For example, for some adjectives ending with a vowel, the vowel is left out when the adjective is declined. Although it would not be excessively difficult to program in for these exceptions it would require a certain amount of coding and would still not produce definitive results due to the fact that there are still a number of exceptions to the exceptions. 17

23 I decided that it would therefore be best to try and persevere with a method that exploited the CELEX database and the fact that it only contains valid wordforms. I decided that rather than using shell scripts and Unix commands I would instead attempt to create a Perl program that was able to inflect the adjectives correctly. Eventually the program that I created, INFLECT, uses a combination of the two methods. That is, I created a Perl program which calls system commands. I decided to do this because the necessary awk and tr commands had already been generated so this would be the best way to move along with the project and avoid repeating work. I thus composed a Perl program that is able to read in from a file a German adjective and noun and, using system calls as above, find the correct gender of the noun and the correctly declined adjectives. Therefore, the output from a successful running of CANTRAN can be input to INFLECT and thus the correct candidate translations are created, ready to be submitted to the Web search. On testing the program there were a number of adjectives that were not found in CELEX and these would cause a problem as it would mean that no candidate translation would be put forward to the Web search. Therefore, for such terms I decided to use the alternative methods described above and simply added the applicable endings to each adjective, depending on the gender of the noun. This would work for the majority of cases. A further, final issue relates to the issue of compound words, as described in section These compound terms do not require any inflection and, therefore, should be treated differently when they are passed to INFLECT. To deal with these, the program was set up so that any translations generated by CANTRAN which consisted of only a single word are not altered when output by INFLECT. 4.2 Translation selection using Web search The final stage in the implementation process involved submitting the candidate translations that have been generated using CANTRAN and INFLECT to a search engine in order to ascertain the best translation according to the one which receives the greatest number of hits. Carrying out such a search by hand would not be a practical task so I will use the Google Web API provided by Google in order to carry out the search automatically. This tool facilitates direct querying of the Google Web directory from within a program written in the language of your choice. It was again decided to create the Google search program would be created in Perl. After having programmed the translation tool in Perl I felt more confident in using the language and it would certainly be advantageous for both programs to be created in the same programming language. 18

24 The construction of the Google search program was not complicated. Writing it simply involved referencing the correct libraries for the Google Web API and setting up a search that returns the total number of hits returned from the query that is submitted. When testing the program an issue that was raised in that the number of results returned from a search using the Web API did not correspond to the number of results when the search was carried out manually using the Google website. Initially it was believed that this was possibly due to an error with the program created but checking of the program did not appear to show any obvious errors in the program and it appeared that the query was set up in the correct way. The Frequently Asked Questions on the Google website did not give any further insight into the problem. However, when looking at the Google Web API discussion groups, also on the Google website ( it became apparent that the problem was not a unique one to myself or a problem with the program that had been created but a general issue with the tool that had been experienced by a number of users. As this was the case there was nothing that could be done to overcome the problem. Although the actual numbers varied between searches from the Google Website and using the API, the relative frequencies between different queries remained the same, meaning that the selection of the best translation should remain unaffected. Also, it was suggested by some users of the discussion groups that the figures quoted by the API are actually more accurate than those provided by the Web search, although this claim cannot be fully substantiated. The program that has been created is able to read in from a file any number of candidate translations and submit them as queries to the Google Web search. Once all candidate translations have been searched, the program ranks the search results according to the number of hits. The top-ranking query is then taken as the correct translation. Where the candidate translation consists of differently inflected forms of the same translation, the results for each part were summed to give an overall figure for each phrase. 19

25 5 Collocation Finder 5.1 Extraction of adjective-noun phrases from a corpus For the evaluation of the final translation tool a number of adjective-noun phrases to be translated must be selected. These were to be extracted from the British National Corpus. The query used to extract from the BNC all the adjective-noun phrases using the Gsearch tool as discussed in section is given below: gsearch M nc bnc_10../demo/grammarbnc <tag = AJ0> < tag = NN1> Options: -M : to search for more than one adjective noun sequence per sentence in the BNC -nc : to give no context data about individual words bnc_10 : the corpus to search, here 10% of the BNC../demo/GrammarBNC : the grammar to use in the search tag = AJ0 : to search for, firstly, adjectives, excluding superlatives and comparatives tag = NN1 : to search for singular nouns (not proper nouns) Originally it was intended to search the entire BNC (100 million words) but this was not possible in practice due to the size of the output file when the entire corpus was used being far too large for my disk quota. For the purposes of this project the corpus size, at 10 million words, is still sufficiently large enough to get a good representative sample. This is shown by the fact that the number of adjective-noun combinations returned from the search was 290,700. This is the number of actual number of combinations returned (the tokens) and, at this stage, contains a number of duplicated terms. The output from the gsearch query was a file containing the collocations but also a lot of formatting data. To leave only the adjective-noun combinations which I required, a C++ program was created. This was a short program which simply reads in from a file the entire gsearch output and outputs only the actual adjective-noun phrases. Due to the fact that the text was formatted in such a way that the line required appeared on lines numbered in multiples of a constant, creating the necessary program was fairly trivial. Checking the output was as expected was also straightforward as this could be achieved by simply checking if the number of lines in the output file corresponded to the number of adjective-noun combinations returned by the gsearch query. 20

26 5.2 Selecting collocations by frequency Once the list of adjective-noun combinations had been created it was necessary to order it by frequency to see which were the most common adjective-noun phrases and, thus, most likely to be collocations. To do this, the unix core tools uniq and sort were used. Firstly, the file was sorted alphabetically using sort and then uniq c was run to remove all duplicate phrases and, in the case of duplicates, the one remaining instance of the phrase was preceded with a number, showing the frequency with which it occurred. Following this, the phrases were arranged in frequency order using the command sort nr, the n option telling it to sort in numerical order and the r option to reverse the order, i.e. putting the most frequent phrases at the top. Following the removal of duplicates the number of unique adjective-noun combinations was 161,810 (the number of word types). For the purposes of comparison, as well as using the most frequent phrases, the whole candidate translation generation and selection procedure was also carried out with some phrases that occurred less often. Therefore, it was decided to also look at phrases that occurred in the BNC with a frequency of 2. This was done because I wanted to use phrases that occurred with a low frequency but not those that only occurred once as many of these would be liable to be obscure phrases for which the dictionary might not provide any suitable translations. In the 10% subsection of the BNC that I used there are 18,265 adjective-noun phrases that occur with a frequency of 2. Thus, the final list of phrases (attached as Appendix E) to be translated consists of the top 50 most frequent phrases and 50 random phrases chosen from the phrases which occurred twice. Of the 50 most common phrases it was decided that 40 were actually collocations and from those that occurred twice this figure stood at 22. Deciding on whether or not these were actually collocations involved analysing them using the definitions and criteria detailed in section Despite the presence of the definitions and criteria, determining whether the phrases were, in fact, collocations remained a non-trivial and somewhat subjective task. In analysing the effectiveness of the collocation finder we are testing the hypothesis that frequency is an effective way of selecting collocations. Manning and Schuetze (1999) state that the frequency method, when filtered by part of speech (as had been done by selecting only adjective-noun phrases), gives surprisingly good results. This is backed up by my findings as the proportion of collocational phrases in the set of high frequency phrases (40/50) is statistically significantly higher than the proportion in the lower frequency phrases (22/50) when carrying out the t-test at the 5% level of significance. 21

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks

More information

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy

More information

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students 69 A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students Sarathorn Munpru, Srinakharinwirot University, Thailand Pornpol Wuttikrikunlaya, Srinakharinwirot University,

More information

Methodological Issues for Interdisciplinary Research

Methodological Issues for Interdisciplinary Research J. T. M. Miller, Department of Philosophy, University of Durham 1 Methodological Issues for Interdisciplinary Research Much of the apparent difficulty of interdisciplinary research stems from the nature

More information

Simple maths for keywords

Simple maths for keywords Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd adam@lexmasterclass.com Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all

More information

Ask your teacher about any which you aren t sure of, especially any differences.

Ask your teacher about any which you aren t sure of, especially any differences. Punctuation in Academic Writing Academic punctuation presentation/ Defining your terms practice Choose one of the things below and work together to describe its form and uses in as much detail as possible,

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France annlor@limsi.fr Cyril Grouin LIMSI-CNRS rue John von Neumann 91400

More information

Overview of the TACITUS Project

Overview of the TACITUS Project Overview of the TACITUS Project Jerry R. Hobbs Artificial Intelligence Center SRI International 1 Aims of the Project The specific aim of the TACITUS project is to develop interpretation processes for

More information

Overview of MT techniques. Malek Boualem (FT)

Overview of MT techniques. Malek Boualem (FT) Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,

More information

WHITE PAPER. Machine Translation of Language for Safety Information Sharing Systems

WHITE PAPER. Machine Translation of Language for Safety Information Sharing Systems WHITE PAPER Machine Translation of Language for Safety Information Sharing Systems September 2004 Disclaimers; Non-Endorsement All data and information in this document are provided as is, without any

More information

Using the BNC to create and develop educational materials and a website for learners of English

Using the BNC to create and develop educational materials and a website for learners of English Using the BNC to create and develop educational materials and a website for learners of English Danny Minn a, Hiroshi Sano b, Marie Ino b and Takahiro Nakamura c a Kitakyushu University b Tokyo University

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Data Deduplication in Slovak Corpora

Data Deduplication in Slovak Corpora Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

More information

A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts

A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts [Mechanical Translation, vol.5, no.1, July 1958; pp. 25-41] A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts A notational

More information

A Report on my Foreign Language Learning Experience BA English Language and Communication

A Report on my Foreign Language Learning Experience BA English Language and Communication Focus A Report on my Foreign Language Learning Experience BA English Language and Communication 1 I aim to discuss and evaluate my acquisition of the grammatical systems of foreign languages; focusing

More information

to selection. If you have any questions about these results or In the second half of 2014 we carried out an international

to selection. If you have any questions about these results or In the second half of 2014 we carried out an international Candidate Experience Survey RESULTS INTRODUCTION As an HR consultancy, we spend a lot of time talking We ve set out this report to focus on the findings of to our clients about how they can make their

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing.

Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing. Use the Academic Word List vocabulary to make tips on Academic Writing Use some of the words below to give advice on good academic writing. abstract accompany accurate/ accuracy/ inaccurate/ inaccuracy

More information

Decision of Technical Board of Appeal 3.5.1 dated 21 April 2004 T 258/03-3.5.1

Decision of Technical Board of Appeal 3.5.1 dated 21 April 2004 T 258/03-3.5.1 ET0258.03-042040020 1 Decision of Technical Board of Appeal 3.5.1 dated 21 April 2004 T 258/03-3.5.1 (Language of the proceedings) Composition of the Board: Chairman: Members: S. V. Steinbrener R. S. Wibergh

More information

How to research and develop signatures for file format identification

How to research and develop signatures for file format identification How to research and develop signatures for file format identification November 2012 Crown copyright 2012 You may re-use this information (excluding logos) free of charge in any format or medium, under

More information

THE BACHELOR S DEGREE IN SPANISH

THE BACHELOR S DEGREE IN SPANISH Academic regulations for THE BACHELOR S DEGREE IN SPANISH THE FACULTY OF HUMANITIES THE UNIVERSITY OF AARHUS 2007 1 Framework conditions Heading Title Prepared by Effective date Prescribed points Text

More information

A Flexible Online Server for Machine Translation Evaluation

A Flexible Online Server for Machine Translation Evaluation A Flexible Online Server for Machine Translation Evaluation Matthias Eck, Stephan Vogel, and Alex Waibel InterACT Research Carnegie Mellon University Pittsburgh, PA, 15213, USA {matteck, vogel, waibel}@cs.cmu.edu

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise.

IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. Peter R. Welbrock Smith-Hanley Consulting Group Philadelphia, PA ABSTRACT Developing

More information

Deposit Identification Utility and Visualization Tool

Deposit Identification Utility and Visualization Tool Deposit Identification Utility and Visualization Tool Colorado School of Mines Field Session Summer 2014 David Alexander Jeremy Kerr Luke McPherson Introduction Newmont Mining Corporation was founded in

More information

Comprendium Translator System Overview

Comprendium Translator System Overview Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4

More information

Translation Solution for

Translation Solution for Translation Solution for Case Study Contents PROMT Translation Solution for PayPal Case Study 1 Contents 1 Summary 1 Background for Using MT at PayPal 1 PayPal s Initial Requirements for MT Vendor 2 Business

More information

English Appendix 2: Vocabulary, grammar and punctuation

English Appendix 2: Vocabulary, grammar and punctuation English Appendix 2: Vocabulary, grammar and punctuation The grammar of our first language is learnt naturally and implicitly through interactions with other speakers and from reading. Explicit knowledge

More information

Reading Listening and speaking Writing. Reading Listening and speaking Writing. Grammar in context: present Identifying the relevance of

Reading Listening and speaking Writing. Reading Listening and speaking Writing. Grammar in context: present Identifying the relevance of Acknowledgements Page 3 Introduction Page 8 Academic orientation Page 10 Setting study goals in academic English Focusing on academic study Reading and writing in academic English Attending lectures Studying

More information

Experiences with Online Programming Examinations

Experiences with Online Programming Examinations Experiences with Online Programming Examinations Monica Farrow and Peter King School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS Abstract An online programming examination

More information

CELTA. Syllabus and Assessment Guidelines. Fourth Edition. Certificate in Teaching English to Speakers of Other Languages

CELTA. Syllabus and Assessment Guidelines. Fourth Edition. Certificate in Teaching English to Speakers of Other Languages CELTA Certificate in Teaching English to Speakers of Other Languages Syllabus and Assessment Guidelines Fourth Edition CELTA (Certificate in Teaching English to Speakers of Other Languages) is regulated

More information

General Guidelines for Conducting Structured Interviews

General Guidelines for Conducting Structured Interviews General Guidelines for Conducting Structured Interviews The Interview Board When developing an interview board the following principles should always apply. Interview boards should consist of three people.

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5 Page 1 of 57 Grade 3 Reading Literary Text Principles of Reading (P) Standard 1: Demonstrate understanding of the organization and basic features of print. Standard 2: Demonstrate understanding of spoken

More information

Keywords academic writing phraseology dissertations online support international students

Keywords academic writing phraseology dissertations online support international students Phrasebank: a University-wide Online Writing Resource John Morley, Director of Academic Support Programmes, School of Languages, Linguistics and Cultures, The University of Manchester Summary A salient

More information

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining

More information

Test of English for Aviation. Guide for: Test-takers & Teachers

Test of English for Aviation. Guide for: Test-takers & Teachers Guide for: Test-takers & Teachers What is the TEA test? TEA is a test of your ability to communicate in English it is not a test of your operational knowledge. TEA is a test of plain English in an aviation

More information

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages 8-10. CHARTE NIVEAU B2 Pages 11-14

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages 8-10. CHARTE NIVEAU B2 Pages 11-14 CHARTES D'ANGLAIS SOMMAIRE CHARTE NIVEAU A1 Pages 2-4 CHARTE NIVEAU A2 Pages 5-7 CHARTE NIVEAU B1 Pages 8-10 CHARTE NIVEAU B2 Pages 11-14 CHARTE NIVEAU C1 Pages 15-17 MAJ, le 11 juin 2014 A1 Skills-based

More information

Guidelines and Procedures for Project Management

Guidelines and Procedures for Project Management Guidelines and Procedures for Project Management Coin-OR Foundation May 17, 2007 Contents 1 Introduction 3 2 Responsibilities 3 3 Contacts and Information 4 4 Definitions 4 5 Establishing a New Project

More information

PATENTS ACT 1977. Whether patent application GB 2383152 A relates to a patentable invention DECISION

PATENTS ACT 1977. Whether patent application GB 2383152 A relates to a patentable invention DECISION BL O/255/05 PATENTS ACT 1977 14 th September 2005 APPLICANT Oracle Corporation ISSUE Whether patent application GB 2383152 A relates to a patentable invention HEARING OFFICER Stephen Probert DECISION Introduction

More information

Special Topics in Computer Science

Special Topics in Computer Science Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS

More information

Albert Pye and Ravensmere Schools Grammar Curriculum

Albert Pye and Ravensmere Schools Grammar Curriculum Albert Pye and Ravensmere Schools Grammar Curriculum Introduction The aim of our schools own grammar curriculum is to ensure that all relevant grammar content is introduced within the primary years in

More information

The Oxford Learner s Dictionary of Academic English

The Oxford Learner s Dictionary of Academic English ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

To download the script for the listening go to: http://www.teachingenglish.org.uk/sites/teacheng/files/learning-stylesaudioscript.

To download the script for the listening go to: http://www.teachingenglish.org.uk/sites/teacheng/files/learning-stylesaudioscript. Learning styles Topic: Idioms Aims: - To apply listening skills to an audio extract of non-native speakers - To raise awareness of personal learning styles - To provide concrete learning aids to enable

More information

InfiniteInsight 6.5 sp4

InfiniteInsight 6.5 sp4 End User Documentation Document Version: 1.0 2013-11-19 CUSTOMER InfiniteInsight 6.5 sp4 Toolkit User Guide Table of Contents Table of Contents About this Document 3 Common Steps 4 Selecting a Data Set...

More information

Modern foreign languages

Modern foreign languages Modern foreign languages Programme of study for key stage 3 and attainment targets (This is an extract from The National Curriculum 2007) Crown copyright 2007 Qualifications and Curriculum Authority 2007

More information

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program. Name: Class: Date: Exam #1 - Prep True/False Indicate whether the statement is true or false. 1. Programming is the process of writing a computer program in a language that the computer can respond to

More information

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy Multi language e Discovery Three Critical Steps for Litigating in a Global Economy 2 3 5 6 7 Introduction e Discovery has become a pressure point in many boardrooms. Companies with international operations

More information

CHAPTER 4 RESULTS. four research questions. The first section demonstrates the effects of the strategy

CHAPTER 4 RESULTS. four research questions. The first section demonstrates the effects of the strategy CHAPTER 4 RESULTS This chapter presents the statistical analysis of the collected data based on the four research questions. The first section demonstrates the effects of the strategy instruction on the

More information

Problems with the current speling.org system

Problems with the current speling.org system Problems with the current speling.org system Jacob Sparre Andersen 22nd May 2005 Abstract We out-line some of the problems with the current speling.org system, as well as some ideas for resolving the problems.

More information

PHP Debugging. Draft: March 19, 2013 2013 Christopher Vickery

PHP Debugging. Draft: March 19, 2013 2013 Christopher Vickery PHP Debugging Draft: March 19, 2013 2013 Christopher Vickery Introduction Debugging is the art of locating errors in your code. There are three types of errors to deal with: 1. Syntax errors: When code

More information

Administrative Support Professionals Competency Framework. The Centre for Learning and Development

Administrative Support Professionals Competency Framework. The Centre for Learning and Development Administrative Support Professionals Competency Framework The Centre for Learning and Development Table of Contents 01. Acknowledgements...3 02. Introduction...4 03. Background...5 04. Competency Assessment

More information

Numbers 101: Cost and Value Over Time

Numbers 101: Cost and Value Over Time The Anderson School at UCLA POL 2000-09 Numbers 101: Cost and Value Over Time Copyright 2000 by Richard P. Rumelt. We use the tool called discounting to compare money amounts received or paid at different

More information

DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY

DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY The content of those documents are the exclusive property of REVER. The aim of those documents is to provide information and should, in no case,

More information

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data

More information

STYLE AND FORMAT REQUIREMENTS MASTERS OF SCIENCE THESIS

STYLE AND FORMAT REQUIREMENTS MASTERS OF SCIENCE THESIS OFFICE OF GRADUATE STUDIES STYLE AND FORMAT REQUIREMENTS MASTERS OF SCIENCE THESIS The University of Wisconsin-Green Bay graduate programs in Applied Leadership for Teaching and Learning and Environmental

More information

Study Plan for Master of Arts in Applied Linguistics

Study Plan for Master of Arts in Applied Linguistics Study Plan for Master of Arts in Applied Linguistics Master of Arts in Applied Linguistics is awarded by the Faculty of Graduate Studies at Jordan University of Science and Technology (JUST) upon the fulfillment

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

AS-LEVEL German. Unit 2 Speaking Test Mark scheme. 1661 June 2015. Version 1.0 Final Mark Scheme

AS-LEVEL German. Unit 2 Speaking Test Mark scheme. 1661 June 2015. Version 1.0 Final Mark Scheme AS-LEVEL German Unit 2 Speaking Test scheme 1661 June 2015 Version 1.0 Final Scheme schemes are prepared by the Lead Assessment Writer and considered, together with the relevant questions, by a panel of

More information

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE SPAN 100/101 ELEMENTARY SPANISH COURSE OBJECTIVES This Spanish course pays equal attention to developing all four language skills (listening, speaking, reading, and writing), with a special emphasis on

More information

Cambridge Primary English as a Second Language Curriculum Framework

Cambridge Primary English as a Second Language Curriculum Framework Cambridge Primary English as a Second Language Curriculum Framework Contents Introduction Stage 1...2 Stage 2...5 Stage 3...8 Stage 4... 11 Stage 5...14 Stage 6... 17 Welcome to the Cambridge Primary English

More information

Predicting Web Hosting Trend by Analyzing the Wikipedia Article Traffic

Predicting Web Hosting Trend by Analyzing the Wikipedia Article Traffic Predicting Web Hosting Trend by Analyzing the Wikipedia Article Traffic Rocco Pascale Manhattan College September 13, 2013 Abstract Popular search engines, such as Google and Bing, provide the user with

More information

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language LUCS Minor 16, 2010. ISSN 1104-1609. ChildFreq: An Online Tool to Explore Word Frequencies in Child Language Rasmus Bååth Lund University Cognitive Science Kungshuset, Lundagård, 222 22 Lund rasmus.baath@lucs.lu.se

More information

SEO - Access Logs After Excel Fails...

SEO - Access Logs After Excel Fails... Server Logs After Excel Fails @ohgm Prepare for walls of text. About Me Former Senior Technical Consultant @ builtvisible. Now Freelance Technical SEO Consultant. @ohgm on Twitter. ohgm.co.uk for my webzone.

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

The Real Challenges of Configuration Management

The Real Challenges of Configuration Management The Real Challenges of Configuration Management McCabe & Associates Table of Contents The Real Challenges of CM 3 Introduction 3 Parallel Development 3 Maintaining Multiple Releases 3 Rapid Development

More information

On-line Submission and Testing of Programming Assignments

On-line Submission and Testing of Programming Assignments On-line Submission and Testing of Programming Assignments Mike Joy and Michael Luck, Department of Computer Science, University of Warwick, COVENTRY, CV4 7AL, UK email: {msj,mikeluck}@dcs.warwick.ac.uk

More information

This presentation explains how to monitor memory consumption of DataStage processes during run time.

This presentation explains how to monitor memory consumption of DataStage processes during run time. This presentation explains how to monitor memory consumption of DataStage processes during run time. Page 1 of 9 The objectives of this presentation are to explain why and when it is useful to monitor

More information

Query term suggestion in academic search

Query term suggestion in academic search Query term suggestion in academic search Suzan Verberne 1, Maya Sappelli 1,2, and Wessel Kraaij 2,1 1. Institute for Computing and Information Sciences, Radboud University Nijmegen 2. TNO, Delft Abstract.

More information

Programming Languages CIS 443

Programming Languages CIS 443 Course Objectives Programming Languages CIS 443 0.1 Lexical analysis Syntax Semantics Functional programming Variable lifetime and scoping Parameter passing Object-oriented programming Continuations Exception

More information

Glossary of translation tool types

Glossary of translation tool types Glossary of translation tool types Tool type Description French equivalent Active terminology recognition tools Bilingual concordancers Active terminology recognition (ATR) tools automatically analyze

More information

How To Write A Comprehensive Exam

How To Write A Comprehensive Exam NURSING GRADUATE PROGRAM PhD COMPREHENSIVE EXAMINATION 2012-2013 Revised October 2012 Table of Contents 1. PURPOSE OF THE COMPREHENSIVE EXAMINATION... 3 2. TIMING OF THE EXAMINATION... 3 3. ROLE OF THE

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

What You Don t Know Will Haunt You.

What You Don t Know Will Haunt You. Comprehensive Consulting Solutions, Inc. Business Savvy. IT Smart. Joint Application Design (JAD) A Case Study White Paper Published: June 2002 (with revisions) What You Don t Know Will Haunt You. Contents

More information

Using SQL Queries in Crystal Reports

Using SQL Queries in Crystal Reports PPENDIX Using SQL Queries in Crystal Reports In this appendix Review of SQL Commands PDF 924 n Introduction to SQL PDF 924 PDF 924 ppendix Using SQL Queries in Crystal Reports The SQL Commands feature

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

Data Coding and Entry Lessons Learned

Data Coding and Entry Lessons Learned Chapter 7 Data Coding and Entry Lessons Learned Pércsich Richárd Introduction In this chapter we give an overview of the process of coding and entry of the 1999 pilot test data for the English examination

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

MA in English language teaching Pázmány Péter Catholic University *** List of courses and course descriptions ***

MA in English language teaching Pázmány Péter Catholic University *** List of courses and course descriptions *** MA in English language teaching Pázmány Péter Catholic University *** List of courses and course descriptions *** Code Course title Contact hours per term Number of credits BMNAT10100 Applied linguistics

More information

Analyzing survey text: a brief overview

Analyzing survey text: a brief overview IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining

More information

Piano Accordion vs. Chromatic Button Accordion

Piano Accordion vs. Chromatic Button Accordion Piano Accordion vs. Chromatic Button Accordion Which is best, piano accordion (PA), or five row chromatic button accordion (CBA)? This is a question which is often debated in newsgroups. The question should

More information

Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse

Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse Rob Sanderson, Matthew Brook O Donnell and Clare Llewellyn What happens with

More information

SOCIETY OF ACTUARIES THE AMERICAN ACADEMY OF ACTUARIES RETIREMENT PLAN PREFERENCES SURVEY REPORT OF FINDINGS. January 2004

SOCIETY OF ACTUARIES THE AMERICAN ACADEMY OF ACTUARIES RETIREMENT PLAN PREFERENCES SURVEY REPORT OF FINDINGS. January 2004 SOCIETY OF ACTUARIES THE AMERICAN ACADEMY OF ACTUARIES RETIREMENT PLAN PREFERENCES SURVEY REPORT OF FINDINGS January 2004 Mathew Greenwald & Associates, Inc. TABLE OF CONTENTS INTRODUCTION... 1 SETTING

More information

How to translate VisualPlace

How to translate VisualPlace Translation tips 1 How to translate VisualPlace The international language support in VisualPlace is based on the Rosette library. There are three sections in this guide. It starts with instructions for

More information

Human-Readable BPMN Diagrams

Human-Readable BPMN Diagrams Human-Readable BPMN Diagrams Refactoring OMG s E-Mail Voting Example Thomas Allweyer V 1.1 1 The E-Mail Voting Process Model The Object Management Group (OMG) has published a useful non-normative document

More information

GCSE Music Unit 3 (42703) Guidance

GCSE Music Unit 3 (42703) Guidance GCSE Music Unit 3 (42703) Guidance Performance This unit accounts for 40% of the final assessment for GCSE. Each student is required to perform two separate pieces of music, one demonstrating solo/it skills,

More information

2.2 Assessors shall not be members of Boards or Joint Boards of Examiners and shall not be entitled unless invited to attend their meetings.

2.2 Assessors shall not be members of Boards or Joint Boards of Examiners and shall not be entitled unless invited to attend their meetings. Regulations for the Examination of Master s Level Degrees 1 Appointment of Examiners 1 Definition of Terms Used: Examiners 1.1 Members of Boards of Examiners shall be designated as Examiners, as follows:

More information

Level 4 Certificate in English for Business

Level 4 Certificate in English for Business Level 4 Certificate in English for Business LCCI International Qualifications Syllabus Effective from January 2006 For further information contact us: Tel. +44 (0) 8707 202909 Email. enquiries@ediplc.com

More information

Database Design For Corpus Storage: The ET10-63 Data Model

Database Design For Corpus Storage: The ET10-63 Data Model January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Submission guidelines for authors and editors

Submission guidelines for authors and editors Submission guidelines for authors and editors For the benefit of production efficiency and the production of texts of the highest quality and consistency, we urge you to follow the enclosed submission

More information

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1 Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically

More information