Fuzzy Translation of Cross-Lingual Spelling Variants

Size: px
Start display at page:

Download "Fuzzy Translation of Cross-Lingual Spelling Variants"

Transcription

1 Fuzzy Translation of Cross-Lingual Spelling Variants Ari Pirkola, Jarmo Toivonen*, Heikki Keskustalo, Kari Visala, Kalervo Järvelin Department of Information Studies Institute of Signal Processing* University of Tampere, Finland Tampere University of Technology {ari.pirkola, heikki.keskustalo, kari.visala, Tampere, Finland ABSTRACT We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second stage, the intermediate forms obtained in the first stage are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The target word list contained English words with the correct equivalents for the source words among them. The source words were translated using the two-step fuzzy translation technique, and the results were compared with those of plain fuzzy matching based translation. The combined technique performed better, sometimes considerably better, than fuzzy matching alone. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Algorithms, Performance, Experimentation Keywords Cross-language retrieval, Fuzzy matching, Transliteration 1. INTRODUCTION In many documents and requests for information, technical terms and proper names are important text elements. Their correct translation therefore is crucial for good performance of machine translation (MT) and cross-language information retrieval (CLIR) systems. Paradoxically, technical terms and names are not generally found in electronic translation dictionaries utilised by MT and CLIR systems. Sometimes such expressions are written identically in different languages and no translation is needed. However, often they are non-identical translatable spelling variant forms, e.g., Chernobyl Tshernobyl. Such kind of similarity Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR 03, July 28 August 1, 2003, Toronto, Canada. Copyright 2003 ACM /03/0007 $5.00. allows the use of n-gram or other fuzzy matching (approximate string matching) based translation, or translation through transliteration. Transliteration refers to phonetic translation across languages with different writing systems [3], such as Arabic to English [11]. In this paper we will present a novel translation technique which is close to transliteration, but no phonetic elements are included in the technique. We call the technique transformation rule based translation (TRT). The technique is suitable for languages of the same writing system. It is suitable for any language pair that shares words of the same origin, i.e., cross-lingual spelling variants. A typical case is a technical term derived from Latin or Greek, or a proper name written differently in different languages. In TRT a word in one language (e.g., Spanish embriologia) is translated into a word in another language (e.g., English embryology) based on the regular correspondences between the characters in spelling variants (for instance, in the example i and ia correspond to y). In this paper, we call a systematically collected list of regular correspondences transformation rules. In the first phase of the transformation rule generation process equivalent term pairs in two languages are extracted from a translation dictionary and are aligned pairwise. The rules for the equivalent terms are then generated using edit distance. In this study, the rules were generated for five language pairs, with English always the target language and Finnish, French, German, Spanish, and Swedish the source languages. Of the source languages French and Spanish belong to the Romance language family, German and Swedish to the Germanic language family while Finnish is a Fenno-Ugric language. We thus have a representative sample of different types of languages based on Latin alphabet. TRT is used in combination with fuzzy matching. As a fuzzy matching technique we used n-gram based matching. We have observed that fuzzy matching alone often misses the correct target words as best matches, for example, Chechnya and Tchetchenie are quite different for any approximate string matching technique. However, the combined technique is often very effective in cases like this. The application of the transformation rules renders a source word more similar to its target equivalent, or the source word may originally be close to its equivalent, so that precise translation may be achieved through fuzzy matching. We thus devised a two-step fuzzy translation method. In the first stage, by using TRT, source words are translated into intermediate forms which often are more similar or identical with their target language equivalents. In the second stage, the intermediate forms

2 are translated through approximate string matching into their target language equivalents. The technique was evaluated using a source word list and a target word list including the equivalents of the source words. The target word list consisted of the index terms of CLEF s [5] LA Times document collection. The list is very large, containing unique English words in base form. As an evaluation measure we used precision at the rank where all the equivalents of the source words have been retrieved. It will be shown that the two-step fuzzy translation performs better than fuzzy matching alone. An important factor that affects TRT effectiveness is the source language. We consider the issue in particular from the standpoint of crosslanguage information retrieval. In CLIR a user may use his or her native language in searching for foreign language documents [4]. In dictionary-based CLIR queries are translated into the language of documents through electronic dictionaries. Technical terms and proper names constitute a major problem in dictionary-based CLIR, since usually just the most commonly used technical terms and names are found in translation dictionaries. However, specific non-dictionary nouns and proper names often supply key evidence on the relevance of documents with respect to a query. Their correct translation is therefore essential for retrieval success. The rest of this paper is organized as follows. Section 2 presents the methodology and Section 3 the findings. Section 4 contains the discussion and conclusions. 2. METHODS AND DATA 2.1 Transformation Rule based Translation Edit Distance Edit distance (ED, Levenshtein distance) is a string similarity measure, defined as the minimum cost needed to convert one string into another. Conversion includes the operations of character substitution (sub), insertion (ins), and deletion (del). We use the term transformation as a general term to cover substitution, insertion, and deletion. For the strings A and B, edit distance is as follows: ED(A, B) = min{n sub + N ins + N del } The equation thus gives the minimum sum of the number of operations needed to convert string A into string B. Edit distance can be computed using the kind of matrix presented in Figure 1. In the matrix, the total ED is cumulated into the right lower corner. The value of the matrix element d(i, j) is computed by choosing the minimal value from the set: {d[i 1,j] + 1, d[i,j - 1] + 1, d[i 1, j - 1] + cost}, where cost = 0, if A[i] = B[i], and cost = 1, if A[i] B[i]. By looking for the minimum cost paths in the matrix, i.e., the paths from the left upper corner to the right lower corner that produce ED, the changes involved in ED can be constructed. Traversing the matrix vertically corresponds to the deletion of a character, traversing horizontally corresponds to insertion, and traversing diagonally corresponds to substitution. Figure 1. Edit distance matrix Figure 2 shows the minimum cost paths for converting the word Haag to Hague. In this case the value of ED is three. It can be seen that five paths lead to the ED of three. Figure 2. Minimum cost paths Automatic Generation of Rules The automatic rule generation process consisted of the following main steps: Extracting similar terms from a dictionary Selection of proper terms Generation of transformation rules Extracting similar terms Sufficiently similar source and target language term pairs were identified, and extracted from a dictionary for further processing. The similarity was determined using edit distance with a threshold value.

3 Selection of proper terms In the next step, all the transformations that produced the minimum ED were searched for each term pair using a recursive algorithm. The algorithm was based on the AllAlignment algorithm described in [1]. From the result set of all transformations, one transformation was selected. The selection was done using the smallest sum of error values. The error values for the transformations were calculated as follows [2]: 0, terms share the same character at the same position 1, consonant - consonant substitution, and vowel - vowel substitution 1, insertion or deletion of a character 2, consonant - vowel substitution, and vowel - consonant substitution Generation of rules In this step, only the selected proper terms were examined. In the first phase of the rule generation process, the rules of double letter/single letter insertions/deletions were generated, e.g., ss s and s ss. In the second phase the strings were studied from the start to the end, and differences were recorded. For example, the strings donut and doughnut differ in the substring ugh. After a cleaning process the rule of on oughn was obtained. This means that the string ugh is inserted into the source word between the letters o and n to get the target word Context Information, Frequency and Confidence Factor A given rule typically occurs in a certain location of a word, and prior to and after a certain character. This context information was recorded for each rule. Occurrence information on the rules was put in a hash table, and the frequency of the rule was computed. Confidence factor is defined as the frequency of a rule divided by the number of source terms where the source substring of the rule occurs. All of these - context information, frequency, and confidence factor are utilised when the automatically generated rules are applied Sample Rules A sample of Spanish-to-English rules is presented in Figure 3. The rules are sorted on the basis of frequency. In the middle location rules the left and right-hand characters of the source and target strings are context characters. The first line, for example, shows that the letter c, after a and prior to i, is replaced by the letter t in the middle of words, with the confidence factor being 72.16% (100% * 674/934). In the beginning and end location rules the right and left-hand characters, respectively, are the context characters. Source Target Location Frequency No of Confidence string string words factor aci ati middle ,16 co c end ,48 ca c end ,19 na n end ,00 na ne end ,58 es s beginning ,60 te the middle ,83 hip hyp middle ,67 cci cti middle ,99 te t end ,94 a al end ,60 idad ity end to t end do d end [etc.] Figure 3. A sample of Spanish-to-English rules 2.2 Translation Resources The following translation resources were used in the study for producing the transformation rules: Multilingual medical dictionary by Andre Fairchild. In this study the English, German, French, and Spanish portions of the dictionary were used. For each language the number of dictionary entries was A Finnish list of medical terms (n=5970) was translated into English using the MOT medical dictionary by Kielikone Plc. to obtain Finnish - English term pairs. A Swedish list of medical terms (n=657) was translated into English using the MOT medical dictionary to obtain Swedish - English term pairs.

4 Using these translation resources we obtained transformation rules for the following language pairs: Finnish-English French-English German-English Spanish-English Swedish-English The number of term pairs on the basis of which transformation rules were produced ranges from 657 to around It can be expected that the more terms there are the better the rules and the more precise the translation results will be. 2.3 Target Word List and Source Words Test source words were collected as described below, and their spelling variants in the target word list were identified. As a target word list we used the index of CLEF s [5] LA Times collection, which contains words. The source words were collected in two ways. The first source word list was collected by browsing the LA Times index from beginning to end. A list of 217 English technical terms (medical, biological, and chemical terms) and place names that in an intellectual analysis were similar in different languages was gathered and was translated into Finnish, French, German, Spanish, and Swedish by a research assistant. To ensure that all the words were correctly translated the translations were checked by native speakers or advanced students of each source language. The first source word list (217 word tuples in five source languages) was split into two parts: a training word list and a test word list. A random sample of one third of the full list was used as training data. We thus obtained 72 training word tuples and 145 final test word tuples. The partitioning of the data into training and test data allows tuning of the TRT method. A second source word list was used to test whether the rules generated by medical dictionaries are suitable for terms in utterly different domains. Finnish, French, German, Spanish, and Swedish terms (n=126 word tuples) were picked from dictionaries. The list included terms in the domains of economics and technology, and a category of miscellaneous terms, which included terms in various domains (history, music, etc.). In all 123 of the 126 English equivalents of the terms were found in the target word list. The remaining 3 equivalents were added into the target list so that all 126 source word tuples were available for the tests. The total number of test words used in the experiments was 5 (languages) * ( ) words = 1355 words. 2.4 N-gram Matching In n-gram matching words are decomposed into n-grams, i.e., into substrings of length n [6, 7, 9, 10, 12]. As n-gram techniques we used and containing two and three characters respectively. Both in and the start and end white spaces were used as constituent characters of n-grams. The degree of similarity between the source and target words w 1 and w 2 was computed on the basis of the number of n-grams that the words have in common and the total number of unique n- grams in the words, as follows [6]: Sim(w 1, w 2 ) = N 1 N 2 N 1 N 2 where N i refers to the set of n-grams derived from the word w i, with i = 1, 2. Digram matching has been reported to be an effective fuzzy matching technique in name searching and spelling variant translation. Pfeifer et al. [7] tested various fuzzy matching techniques for surname variants and found that the best single method was digram matching. In our earlier research we tested n- gram based translation of cross-lingual spelling variants and found that formed of non-consecutive characters of words (which we call skipgrams) performed better than made up of consecutive characters [8]. Trigrams performed worse than, but sometimes gave better results than. Thus, and are appropriate fuzzy matching techniques for use with TRT. In this study, however, the aim is to explore whether TRT together with fuzzy matching is a viable method rather than to select the best TRT and fuzzy matching combination. 2.5 Translation Strategies Two TRT translation strategies were examined. The first one is called a high confidence factor (HCF) strategy. Using a relatively high confidence factor as a threshold this strategy seeks to minimise the number of incorrect transformations. Based on the training results a confidence factor of 50% was used as a threshold. For each source word one intermediate form was produced by applying to a source word all the rules applicable to it (one rule, two rules etc., or no rule). A drawback associated with HCF is that the number of rules that are available is limited. In HCF the rules were applied to source words in the following reading order: (1) the location of the rules in source words, (2) the source string length, and (3) confidence factor. In (1) the application order was as follows: end, beginning, and middle location rules. For example, for the Finnish word konvektio the rules of o on (end), ko co (beginning), and ekt ect (middle) were applied in this order to yield the intermediate form convection (which is a correct translation). In (2) and (3) the rules were applied starting from the longest source string and the highest confidence factor value. In the case of competitive rules (i.e., the same character sequence may be transformed using more than one rule) only the first rule of the reading was applied to a word. As TRT is a new method we did not have prior knowledge which order might give the most accurate intermediate forms. The application order of the rules has effects in the case of competitive rules which, however, were not common. However, the optimization of the application order is a question that needs further investigation. The second strategy is called a low confidence factor (LCF) strategy. For each source word all the possible intermediate forms were produced by applying to a source word all the rules applicable to it. However, a threshold confidence factor of 10% was used to filter out unreliable rules. For example, for the Finnish word konvektio 7 intermediate forms were obtained, including the forms konvectio, convektion, and convection. In LCF the application order of the rules is irrelevant, as each order

5 yields the same intermediate forms. Each intermediate form of the source word gave one result list. These were combined to yield one ranked result list. The rationale behind LCF is that it is likely that the set of intermediate forms obtained through TRT includes the correct equivalent of the source word, provided that the rules are good (the original source word was included in the set of intermediate forms). A drawback associated with LCF is that it may give many incorrect transformations. Both in HCF and LCF the (bad) rules whose frequency was < 50 were removed. 2.6 Evaluation Each source word possessed one correct equivalent in the target word list. For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching, as follows: 1 Precision = pce Finally, average precision over all test words was computed. The calculation of precision was elaborated as follows. Sometimes two or more words share the same SIM value. Therefore the results were evaluated using two evaluation measures: Worst Position and Average Position Precision. In Worst Position Precision, the correct equivalent that shared the same SIM value with other words (incorrect equivalents) was assumed to be the last word among the words with equal SIM value. In Average Position Precision the correct equivalent was assumed to be in the middle of the set of the words with equal SIM value. 3. FINDINGS For Swedish, transformation rules were produced using 657 term pairs only. The combined TRT and fuzzy matching technique was not useful, but it performed as well or slightly worse than fuzzy matching alone. The Swedish results suggest that the rules should be formed on the basis of thousands rather than hundreds of term pairs. 3.1 HCF Strategy In the HCF translation strategy the two evaluation measures gave almost the same results. This is due to the small number of matching words having the same SIM value. We therefore show Average Position Precision results only, which are shown in Tables 1-4. For all languages in Tables 1-4, the number of test word types is as follows: Medical, biological, and chemical terms, n=90 (called Bio terms in the tables) Place names, n=55 Terms in economics, n=31 Terms in technology, n= 36 Miscellaneous terms, n=59 There are several clear trends in Tables 1-4: The combined TRT and fuzzy matching technique performs better than fuzzy matching alone, but its effectiveness depends on the source language. For Finnish, performance improvements are remarkable (Table 1). Also for German (Table 3) and Spanish (Table 4) the technique is useful, but for French precision is changed only slightly (Table 2). Sometimes it is improved and sometimes decreased. In most cases TRT with performs better than TRT with. The combined technique is useful for all term types. Thus, the results clearly answer the question whether the rules generated by medical dictionaries are suited for all term types (or just medical terms). In fact, the best improvements are achieved in the domain of technology. Table 5 shows the effects of TRT for technological terms (n=36 terms) without n-gram matching, with a threshold confidence factor 50%. Positive total transformation means that most individual transformations are correct (i.e., two correct transformations and one incorrect transformation, or one correct transformation, but another rule should be applied to obtain a correct translation). Neutral total transformation means that one rule applied to a word yields a correct transformation and another rule an incorrect transformation. The meaning of negative total transformation is obvious. No change due to transformation is associated with two factors: (a) there is no transformation rule available (given the threshold confidence factor 50%), and (b) there is no need for transformation since the source and English terms are identical. The figures in parentheses refer to the number of identical terms shared by a source language and English. In part, the results can be explained on the basis of the number of identical terms shared by a source language and English (see also Section 4). As can be seen in Table 5, French and English terms are often identical (25/36 identical terms), and the effects of TRT are minor, while Finnish very often uses its own spelling (no identical terms), and TRT is very effective in Finnish-to-English translation. For all the 216 technical terms the percentages of identical terms are as follows: Finnish (0.0%), French (48.8%), German (21.7%), and Spanish (11.1%). The choice of using a threshold confidence factor 50% for all test languages probably also has clear effects on the results. It is possible that tuning the confidence factor for each language would have given better results, for example, for Spanish. This issue needs further investigation. 3.2 LCF Strategy As in the HCF translation strategy, in LCF precision was increased through TRT. No major differences were found between the effectiveness of the HCF and LCF strategies. For Finnish and German there were no clear trends regarding the relative effectiveness of the strategies. However, for French and Spanish LCF yielded better results. The LCF results for Spanish (Average Position Precision) are reported in Table 6. As can be seen, the relative improvement percentage due to TRT with respect to baseline is up to 28.0%, while the corresponding HCF figure for Spanish is 9.8% (Table 4). TRT with performs roughly as well as TRT with.

6 Table 1. Precision of the combined TRT and fuzzy Finnish-to-English matching. High Confidence Factor. Term type Digrams TRT with Trigrams TRT with Bio terms Place names Economics Technology Miscellaneous Table 2. Precision of the combined TRT and fuzzy French-to-English matching. High Confidence Factor. Term type Digrams TRT with Trigrams TRT with Bio terms Place names Economics Technology Miscellaneous Table 3. Precision of the combined TRT and fuzzy German-to-English matching. High Confidence Factor. Term type Digrams TRT with Trigrams TRT with Bio terms Place names Economics Technology Miscellaneous Table 4. Precision of the combined TRT and fuzzy Spanish-to-English matching. High Confidence Factor. Term type Digrams TRT with Trigrams TRT with Bio terms Place names Economics Technology Miscellaneous

7 Table 5. Effectiveness of TRT without n-gram matching for technological terms (n=36 terms). Language Correct translation Positive total transformation No change Neutral total transformation Negative total transformation Finnish (0) 4 - French 4-32 (25) - - German (10) - 1 Spanish (8) - 2 Table 6. Precision of the combined TRT and fuzzy Spanish-to-English matching. Low Confidence Factor. Term type Digrams TRT and Trigrams TRT and Bio terms Place names Economics Technology Miscellaneous DISCUSSION AND CONCLUSIONS Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries. This has a depressing effect on CLIR performance, as such expressions are often prime keys in queries. In this study we presented a novel fuzzy translation technique based on automatically generated transformation rules and fuzzy matching. Two translation strategies were tested. In the high confidence factor strategy the aim was to minimise the number of incorrect transformations by using a relatively high confidence factor. Each source word yielded one intermediate form. In the low confidence factor strategy the rules were applied extensively. A source word often yielded several intermediate forms. Digram and trigam matching were tested in combination with TRT. The results were encouraging as both strategies and combination methods performed better than and alone. Generally, the effectiveness of fuzzy translation depends on (1) the frequency of identical terms shared by a source and a target language, and (2) the extent of variation in the spelling variants between a source and a target language. In French and English technical terms often are identical, and the potential improvements due fuzzy translation are limited. On the other hand, fuzzy translation is well suited for language pairs with a high percentage of similar but non-identical terms. Digrams and alone often failed to yield precise translations for terms which differed in more than two letters, viz., the extent of variation in the spelling variants was relatively high. For example, the correct equivalent allergy of the Spanish term alergia was found at the 27 th position in the digram result list, whereas in the combined TRT and digram list it was at the first position, since TRT gave a correct translation. The strengths of the combined technique are marked particularly in cases where the extent of variation is very high, e.g., Chechnya Tchetchenie. In cases like this fuzzy matching alone is powerless. The figures below show the percentage of correct equivalents in four position classes in the ranked result list of Fin-Eng/TRT and digram matching (avg. precision 72.0%, Table 1). The distribution statistics is typical of all cases of this level precision. Position class % Correct equivalents > As shown, 80% of the correct equivalents are within the set of four highest ranked words. In this case, the distribution figures suggest that the TRT based fuzzy translation technique is viable in operational CLIR systems, the noise being acceptable. Moreover, it should be noted that there are several ways to improve this novel technique (see below). It seems clear that spelling variation does not depend on the domain of terms within a language. We consider this an important finding suggesting that it is possible to use a dictionary of one specific domain to produce general transformation rules for a language pair. The finding is reasonable, since orthography is rather a language specific phenomenon than a domain specific phenomenon within a language. Based on the results of this study our future research will involve the identification of language pairs for which fuzzy translation is effective, the improvement of the rules (for example, utilising rule co-occurrence information), testing the effects of tuning a

8 confidence factor by a specific language pair, selecting the best TRT and fuzzy matching combination, and testing how to apply fuzzy translation in actual CLIR research. Regarding the best combination we will explore other fuzzy matching techniques than those tested in this study together with TRT. One promising method is LCS (longest common subsequence) and another skipgrams [8]. The actual CLIR research seeks to answer the question how fuzzy translation should be applied in an automatic CLIR query formulation and interactive CLIR to achieve the best possible retrieval performance. 5. ACKNOWLEDGMENTS The Multilingual Medical Technical Dictionary ( was provided by Andre Fairchild, of Denver, Colorado, USA. We would like to thank Andre Fairchild for permission to use the dictionary. ENGTWOL morphological analyser was used for the morphological analysis of the English data. ENGTWOL (Morphological Transducer Lexicon Description of English): Copyright (c) Atro Voutilainen and Juha Heikkilä. TWOL-R (Run-Time Two-Level Program): Copyright (c) Kimmo Koskenniemi and Lingsoft Ltd This work was partly financed by the Clarity Information Society Technologies (IST) Programme, Proposal/Contract no: IST REFERENCES [1] Charras C. and Lecroq, T Sequence comparison. Available from: ~lecroq/ seqcomp [2] Covington, M.A An algorithm to align words for historical comparison. Computational Linguistics, 22(4), [3] Knight, K. and Graehl, J Machine transliteration. Computational Linguistics, 24(4), [4] Oard, D. and Diekema, A Cross-language information retrieval. Annual Review of Information Science and Technology (ARIST), 33, [5] Peters, C CLEF - Cross-Language Evaluation Forum. [6] Pfeifer, U., Poersch, T. and Fuhr, N Retrieval effectiveness of proper name search methods. Information Processing &Management, 32(6), [7] Pfeifer, U., Poersch, T. and Fuhr, N Searching proper names in databases. HIM, [8] Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A-P. and Järvelin, K Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants. Information Research, 7(2). Available from: [9] Robertson, A.M. and Willett, P Applications of n- grams in textual information systems. Journal of Documentation, 54(1), [10] Salton, G Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, Mass.: Addison-Wesley. [11] Stalls, B. and Knight, K Translating names and technical terms in Arabic text. Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages. [12] Zobel, J. and Dart, P Phonetic string matching: lessons from information retrieval. Proc. ACM SIGIR, Zurich, Switzerland, pp

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

English to Arabic Transliteration for Information Retrieval: A Statistical Approach

English to Arabic Transliteration for Information Retrieval: A Statistical Approach English to Arabic Transliteration for Information Retrieval: A Statistical Approach Nasreen AbdulJaleel and Leah S. Larkey Center for Intelligent Information Retrieval Computer Science, University of Massachusetts

More information

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information

More information

direct dictionary translation in CLIR

direct dictionary translation in CLIR Transitive dictionary translation challenges direct dictionary translation in CLIR Raija Lehtokangas a, Eija Airio a, Kalervo Järvelin b a Department of Information Studies, University of Tampere, Finland

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Coping with OCR errors in NE search

Coping with OCR errors in NE search Coping with OCR errors in NE search Heikki Keskustalo School of Information Sciences University of Tampere Named Entity Recognition in Digital Humanities Workshop University of Helsinki June 9-10 2015

More information

Using Wikipedia to Translate OOV Terms on MLIR

Using Wikipedia to Translate OOV Terms on MLIR Using to Translate OOV Terms on MLIR Chen-Yu Su, Tien-Chien Lin and Shih-Hung Wu* Department of Computer Science and Information Engineering Chaoyang University of Technology Taichung County 41349, TAIWAN

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Search Query and Matching Approach of Information Retrieval in Cloud Computing International Journal of Advances in Electrical and Electronics Engineering 99 Available online at www.ijaeee.com & www.sestindia.org ISSN: 2319-1112 Search Query and Matching Approach of Information Retrieval

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Improving Non-English Web Searching (inews07)

Improving Non-English Web Searching (inews07) SIGIR 2007 WORKSHOP REPORT Improving Non-English Web Searching (inews07) Fotis Lazarinis Technological Educational Institute Mesolonghi, Greece lazarinf@teimes.gr Jesus Vilares Ferro University of A Coruña

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Regular Expressions and Automata using Haskell

Regular Expressions and Automata using Haskell Regular Expressions and Automata using Haskell Simon Thompson Computing Laboratory University of Kent at Canterbury January 2000 Contents 1 Introduction 2 2 Regular Expressions 2 3 Matching regular expressions

More information

Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

2-3 Automatic Construction Technology for Parallel Corpora

2-3 Automatic Construction Technology for Parallel Corpora 2-3 Automatic Construction Technology for Parallel Corpora We have aligned Japanese and English news articles and sentences, extracted from the Yomiuri and the Daily Yomiuri newspapers, to make a large

More information

BITS: A Method for Bilingual Text Search over the Web

BITS: A Method for Bilingual Text Search over the Web BITS: A Method for Bilingual Text Search over the Web Xiaoyi Ma, Mark Y. Liberman Linguistic Data Consortium 3615 Market St. Suite 200 Philadelphia, PA 19104, USA {xma,myl}@ldc.upenn.edu Abstract Parallel

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval Overview of iclef 2008: search log analysis for Multilingual Image Retrieval Julio Gonzalo Paul Clough Jussi Karlgren UNED U. Sheffield SICS Spain United Kingdom Sweden julio@lsi.uned.es p.d.clough@sheffield.ac.uk

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

A Comparative Study of Online Translation Services for Cross Language Information Retrieval

A Comparative Study of Online Translation Services for Cross Language Information Retrieval A Comparative Study of Online Translation Services for Cross Language Information Retrieval Ali Hosseinzadeh Vahid, Piyush Arora, Qun Liu, Gareth J. F. Jones ADAPT Centre / CNGL School of Computing Dublin

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

2 SYSTEM DESCRIPTION TECHNIQUES

2 SYSTEM DESCRIPTION TECHNIQUES 2 SYSTEM DESCRIPTION TECHNIQUES 2.1 INTRODUCTION Graphical representation of any process is always better and more meaningful than its representation in words. Moreover, it is very difficult to arrange

More information

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292

More information

Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently

More information

Improved Single and Multiple Approximate String Matching

Improved Single and Multiple Approximate String Matching Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile

More information

New Hash Function Construction for Textual and Geometric Data Retrieval

New Hash Function Construction for Textual and Geometric Data Retrieval Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

More information

Phonetic Models for Generating Spelling Variants

Phonetic Models for Generating Spelling Variants Phonetic Models for Generating Spelling Variants Rahul Bhagat and Eduard Hovy Information Sciences Institute University Of Southern California 4676 Admiralty Way, Marina Del Rey, CA 90292-6695 {rahul,

More information

Comparative Analysis on the Armenian and Korean Languages

Comparative Analysis on the Armenian and Korean Languages Comparative Analysis on the Armenian and Korean Languages Syuzanna Mejlumyan Yerevan State Linguistic University Abstract It has been five years since the Korean language has been taught at Yerevan State

More information

EuropeanaConnect Multilinguality Survey

EuropeanaConnect Multilinguality Survey EuropeanaConnect Multilinguality Survey Nicola Ferro & Vivien Petras Workshop at ICSD 2009 Trento, Italy 9 September 2009 Background EuropeanaConnect Task 2.1 User studies & multilingual resources use:

More information

A MULTILINGUAL AND LOCATION EVALUATION OF SEARCH ENGINES FOR WEBSITES AND SEARCHED FOR KEYWORDS

A MULTILINGUAL AND LOCATION EVALUATION OF SEARCH ENGINES FOR WEBSITES AND SEARCHED FOR KEYWORDS A MULTILINGUAL AND LOCATION EVALUATION OF SEARCH ENGINES FOR WEBSITES AND SEARCHED FOR KEYWORDS Anas AlSobh Ahmed Al Oroud Mohammed N. Al-Kabi Izzat AlSmadi Yarmouk University Jordan ABSTRACT Search engines

More information

Problems with the current speling.org system

Problems with the current speling.org system Problems with the current speling.org system Jacob Sparre Andersen 22nd May 2005 Abstract We out-line some of the problems with the current speling.org system, as well as some ideas for resolving the problems.

More information

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti

More information

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety

More information

Concepts of digital forensics

Concepts of digital forensics Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence

More information

A Joint Sequence Translation Model with Integrated Reordering

A Joint Sequence Translation Model with Integrated Reordering A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

More information

Using COTS Search Engines and Custom Query Strategies at CLEF

Using COTS Search Engines and Custom Query Strategies at CLEF Using COTS Search Engines and Custom Query Strategies at CLEF David Nadeau, Mario Jarmasz, Caroline Barrière, George Foster, and Claude St-Jacques Language Technologies Research Centre Interactive Language

More information

Ontology-Based Multilingual Information Retrieval

Ontology-Based Multilingual Information Retrieval Ontology-Based Multilingual Information Retrieval Jacques Guyot * Saïd Radhouani *,** Gilles Falquet * * Centre universitaire d informatique 24, rue Général-Dufour, CH-1211 Genève 4, Switzerland ** Laboratoire

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

OPTIMIZING CONTENT FOR TRANSLATION ACROLINX AND VISTATEC

OPTIMIZING CONTENT FOR TRANSLATION ACROLINX AND VISTATEC OPTIMIZING CONTENT FOR TRANSLATION ACROLINX AND VISTATEC We ll look at these questions. Why does translation cost so much? Why is it hard to keep content consistent? Why is it hard for an organization

More information

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction Regression Testing of Database Applications Bassel Daou, Ramzi A. Haraty, Nash at Mansour Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty, nmansour@lau.edu.lb Keywords: Regression

More information

Five Pronunciation Games for Brazil

Five Pronunciation Games for Brazil Five Pronunciation Games for Brazil Mark Hancock with Ricardo Sili I presented a workshop called 'Pronunciation Games for Brazil' with Ricardo Sili at the 13th BRAZ-TESOL National Convention. This article

More information

SINAI at WEPS-3: Online Reputation Management

SINAI at WEPS-3: Online Reputation Management SINAI at WEPS-3: Online Reputation Management M.A. García-Cumbreras, M. García-Vega F. Martínez-Santiago and J.M. Peréa-Ortega University of Jaén. Departamento de Informática Grupo Sistemas Inteligentes

More information

Multistep Dynamic Expert Sourcing

Multistep Dynamic Expert Sourcing +33 1 69 33 59 59 MULTISTEP DYNAMIC EXPERT SOURCING 1 A Novel Approach for Open Innovation Platforms Multistep Dynamic Expert Sourcing Albert Meige & Boris Golden August 2010 X- Technologies Ecole Polytechnique

More information

QUT Digital Repository: http://eprints.qut.edu.au/

QUT Digital Repository: http://eprints.qut.edu.au/ QUT Digital Repository: http://eprints.qut.edu.au/ Lu, Chengye and Xu, Yue and Geva, Shlomo (2008) Web-Based Query Translation for English-Chinese CLIR. Computational Linguistics and Chinese Language Processing

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n.

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Finding Content in File-Sharing Networks When You Can t Even Spell

Finding Content in File-Sharing Networks When You Can t Even Spell Finding Content in File-Sharing Networks When You Can t Even Spell Matei A. Zaharia, Amit Chandel, Stefan Saroiu, and Srinivasan Keshav University of Waterloo and University of Toronto Abstract: The query

More information

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY 2.1 Introduction In this chapter, I am going to introduce Database Management Systems (DBMS) and the Structured Query Language (SQL), its syntax and usage.

More information

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt> Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.

More information

4 Pitch and range in language and music

4 Pitch and range in language and music 4 Pitch and range in language and music 4.1 Average and range of pitch in spoken language and song 4.1.1 Average and range of pitch in language Fant (1956) determined the average values for fundamental

More information

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89 by Joseph Collison Copyright 2000 by Joseph Collison All rights reserved Reproduction or translation of any part of this work beyond that permitted by Sections

More information

HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND IGNORING SPELLING ERROR IN SEARCH KEY BASED ON DOUBLE METAPHONE ALGORITHM

HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND IGNORING SPELLING ERROR IN SEARCH KEY BASED ON DOUBLE METAPHONE ALGORITHM HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND IGNORING SPELLING ERROR IN SEARCH KEY BASED ON DOUBLE METAPHONE ALGORITHM Md. Palash Uddin 1, Ashfaque Ahmed 2, Md. Delowar Hossain

More information

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables 1 M.Naveena, 2 S.Sangeetha 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 naveenaskrn@gmail.com,

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines , 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

More information

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

More information

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment 2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

More information

University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion

University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion Gina-Anne Levow University of Chicago 1100 E. 58th St, Chicago, IL 60637, USA levow@cs.uchicago.edu Abstract Pseudo-relevance feedback,

More information

Reading 13 : Finite State Automata and Regular Expressions

Reading 13 : Finite State Automata and Regular Expressions CS/Math 24: Introduction to Discrete Mathematics Fall 25 Reading 3 : Finite State Automata and Regular Expressions Instructors: Beck Hasti, Gautam Prakriya In this reading we study a mathematical model

More information

Integration of a Multilingual Keyword Extractor in a Document Management System

Integration of a Multilingual Keyword Extractor in a Document Management System Integration of a Multilingual Keyword Extractor in a Document Management System Andrea Agili *, Marco Fabbri *, Alessandro Panunzi +, Manuel Zini * * DrWolf s.r.l., + Dipartimento di Italianistica - Università

More information

Data Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Data Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1 Jens Teubner Data Warehousing Winter 2014/15 1 Data Warehousing Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Winter 2014/15 Jens Teubner Data Warehousing Winter 2014/15 152 Part VI ETL Process

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

A Natural Language Query Processor for Database Interface

A Natural Language Query Processor for Database Interface A Natural Language Query Processor for Database Interface Mrs.Vidya Dhamdhere Lecturer department of Computer Engineering Department G.H.Raisoni college of Engg.(Pune University) vidya.dhamdhere@gmail.com

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Computer Aided Document Indexing System

Computer Aided Document Indexing System Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia

More information

A Comparison of Dictionary Implementations

A Comparison of Dictionary Implementations A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC) Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger European Commission Joint Research Centre (JRC) https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

LSI TRANSLATION PLUG-IN FOR RELATIVITY. within

LSI TRANSLATION PLUG-IN FOR RELATIVITY. within within LSI Translation Plug-in (LTP) for Relativity is a free plug-in that allows the Relativity user to access the STS system 201 Broadway, Cambridge, MA 02139 Contact: Mark Ettinger Tel: 800-654-5006

More information

Electronic Document Management Using Inverted Files System

Electronic Document Management Using Inverted Files System EPJ Web of Conferences 68, 0 00 04 (2014) DOI: 10.1051/ epjconf/ 20146800004 C Owned by the authors, published by EDP Sciences, 2014 Electronic Document Management Using Inverted Files System Derwin Suhartono,

More information

INTERNATIONAL COMPARISONS OF PART-TIME WORK

INTERNATIONAL COMPARISONS OF PART-TIME WORK OECD Economic Studies No. 29, 1997/II INTERNATIONAL COMPARISONS OF PART-TIME WORK Georges Lemaitre, Pascal Marianna and Alois van Bastelaer TABLE OF CONTENTS Introduction... 140 International definitions

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base 32 Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base Brant N. Kay Brian C. Rineer SAS Institute Inc. SAS Institute Inc. 100 SAS Campus Drive 100 SAS Campus Drive

More information

Automatic Text Processing: Cross-Lingual. Text Categorization

Automatic Text Processing: Cross-Lingual. Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo

More information

How to translate VisualPlace

How to translate VisualPlace Translation tips 1 How to translate VisualPlace The international language support in VisualPlace is based on the Rosette library. There are three sections in this guide. It starts with instructions for

More information

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications INSIGHT SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications José Curto David Schubmehl IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Comprendium Translator System Overview

Comprendium Translator System Overview Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4

More information

Physical Database Design and Tuning

Physical Database Design and Tuning Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence

More information

PROMT Technologies for Translation and Big Data

PROMT Technologies for Translation and Big Data PROMT Technologies for Translation and Big Data Overview and Use Cases Julia Epiphantseva PROMT About PROMT EXPIRIENCED Founded in 1991. One of the world leading machine translation provider DIVERSIFIED

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

1. Physical Database Design in Relational Databases (1)

1. Physical Database Design in Relational Databases (1) Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence

More information

! # % & (() % +!! +,./// 0! 1 /!! 2(3)42( 2

! # % & (() % +!! +,./// 0! 1 /!! 2(3)42( 2 ! # % & (() % +!! +,./// 0! 1 /!! 2(3)42( 2 5 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 1073 A Comparison of Standard Spell Checking Algorithms and a Novel

More information

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA Tibetan For Windows - Software Development and Future Speculations Marvin Moser, Tibetan for Windows & Lucent Technologies, USA Introduction This paper presents the basic functions of the Tibetan for Windows

More information

2. EXPLICIT AND IMPLICIT FEEDBACK

2. EXPLICIT AND IMPLICIT FEEDBACK Comparison of Implicit and Explicit Feedback from an Online Music Recommendation Service Gawesh Jawaheer Gawesh.Jawaheer.1@city.ac.uk Martin Szomszor Martin.Szomszor.1@city.ac.uk Patty Kostkova Patty@soi.city.ac.uk

More information