An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages
|
|
|
- Victoria Hardy
- 9 years ago
- Views:
Transcription
1 An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages Rohit Bharadwaj G SIEL, LTRC IIIT Hyd [email protected] Niket Tandon Databases and Information Systems Max-Planck-Institut fr Informatik [email protected] Vasudeva Varma SIEL, LTRC IIIT Hyd [email protected] Abstract The problem of extracting bilingual dictionaries from Wikipedia is well known and well researched. Given the structural and rich multilingual content of Wikipedia, a language independent approach is necessary for extracting dictionaries for various languages more so for under-resourced languages. In our attempt to mine dictionaries for under-resourced languages, we developed an iterative approach to construct parallel corpus for building a dictionary, for which we consider several kinds of Wikipedia article information like title, infobox information, category, article text and dictionaries already built at each phase. The average precision over various datasets is encouraging with maximum precision of 76.7%, performing better than existing systems. As no language-specific resources are used, our method is applicable to any pair of language with special focus on under-resourced languages and hence breaking the language barrier. 1 Introduction The World-Wide Web (W3) was developed to be a pool of human knowledge, which would allow collaborators in remote sites to share their ideas and all aspects of a common project ( (Wardrip- Fruin and Montfort, 2003)). But at present, it is far from achieving this vision. For someone who reads only German, it is German-wide-web and Hindi reader sees it as Hindi-wide-web. To bridge the gap between the information available and languages known, Cross Language Information Access (CLIA) systems are vital. More so in these days where large content in many languages is generated daily in the form of news articles, blogs etc. Similar argument can be made on Wikipedia articles too. Figure 1: Number of Wikipedia pages(y-axis) with and without Inter language link (ILL) to english in each language(x-axis) The statistics in the figure show the growth of Wikipedia in each language irrespective of the presence of an English counterpart and we can conclude that a cross lingual system is necessary to process this rich language content. The first step in leveraging this rich multilingual resource is to build bilingual dictionaries for further processing. For resource-rich languages like French, we can afford to use various language resources like POS tagger, chunker and other prior information for extracting dictionaries but for under-resourced languages like Telugu, we need a language-independent approach for building the dictionary. Though the problem of extracting dictionaries from Wikipedia has been well researched, language-independent extraction of dictionaries is of prime importance for underresourced languages. Our work is different from existing techniques in the following ways: Existing techniques exploit Wikipedia individually in each method and combine them in the end to extract dictionaries. We, instead, use the already built dictionary iteratively to improve the precision and coverage. Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from
2 Infoboxes are the vital source of information which mostly cover the proper nouns related to the title. In order to increase the coverage of proper nouns, we harnessed infoboxes of the articles. Not much of the work has been done in mining Wikipedia for building English-Hindi dictionary as Hindi being an under-resourced language. As all 272 languages present in Wikipedia 1 cannot have the same resources, we aim to build bilingual dictionaries using the richly linked structure of Wikipedia documents instead of languagespecific resources. The stop words are determined based on the word frequency in Wikipedia. We use English Wikipedia as base and build English- Hindi and English-Telugu dictionaries. This paper discusses building the English-Hindi bilingual dictionary. The same methodology is also applied to build English-Telugu bilingual dictionary. 2 Related Work Dictionary building can be classified into two approaches, manual and automatic. For manual construction of dictionaries, there are various attempts like JMdict/EDCIT project (Breen, 2004). However, large amount of resources are required to construct and maintain manually built dictionaries. Also dictionaries built manually are ineffective for most of the recent vocabulary that is being added into the language everyday. In order to reduce the manual effort and keep dictionaries up-to-date, much focus is on automatic extraction of dictionaries from parallel or comparable corpora. The manually created language-specific rules, which formed the basis for automatic dictionary extraction in initial stages, were later replaced by statistical models. Initial works like (Brown et al., 1990) and (Kay and Roscheisen, 1993) were in the direction of statistical machine translation while their models can also be used for not just translation but also for dictionary building. The major requirement for using statistical methods is the availability of bilingual parallel corpora, which again is limited for under-resourced languages. The coverage of dictionaries built using statistical translation models is also less. Though they 1 of_wikipedias work well for high frequency terms, they fail when terms that are not present in the corpus are encountered like technical jargaon etc. Factors like sentence structure, grammatical differences, availability of language resources and the amount of parallel corpus available further hamper the recall and coverage of the dictionaries extracted. (Fung and McKeown, 1997) showed that parallel corpus is not sufficient for good accuracy as large amount of text is added or omitted for the sake of conveying the meaning clearly. After parallel corpora, few attempts have been made in the direction of building bilingual dictionaries using comparable corpora. (Sadat et al., 2003) uses comparable corpora along with linguistic resources for dictionary building. They perform bi-directional translation for achieving translation candidates that are later re-ranked by various methods using WWW, comparable corpora and interactive mode for phrasal translation. (Fung and McKeown, 1997) shows various ways for translating technical terms using noisy parallel corpora. Few works that use Wikipedia to construct dictionaries are described below. (Tyers and Pienaar, 2008) built a bi-lingual dictionary using only the inter-language links(ill). They collected a source word list and used the ILL for each word in the wordlist to find the corresponding translation. We follow similar approach for building dictionary using titles. (Erdmann et al., 2008) extracted bi-lingual terminology from Wikipedia using ILL, Redirect pages and link text to consider the translation candidates that were ranked using the number of backward links. Though they achieved good results, they manually fixed the weights assigned to each of the category. Later, a classifier was employed to determine these weights in (Erdmann et al., 2009). Along with ILL, we exploit infobox, Wiki categories and text of the inter language linked articles to increase the size of dictionary. The motivation behind using infobox is the fact that infobox contain the factoid information of the Wikipedia article and hence most of the query terms associated with the topic can be translated. Categories and the first paragraph of the Wikipedia article also form the prospective query terms that can be associated with the topic. Hence we had built the dictionary using them. Though we considered ILL links; Redirect text and anchor text are not used because we have considered the first few lines
3 of the article which generally contains the redirect text. We ignored anchor text as we have restricted ourselves to the first paragraph of the article instead of entire text. (Adafre and de Rijke, 2006) extracted near parallel sentences from the cross-language linked articles. They achieved the task using an available dictionary and link structure of Wikipedia. They built a cross-lingual lexicon dictionary using the ILL links present across the articles. Using this dictionary, they translated sentences in one language to other and measured the similarity using Jaccard Similarity coefficient. They compare and conclude that both the approaches give almost same similarity scores in terms of precision and recall. We are more interested in the link structure as it harnesses Wikipedia to find near-parallel sentences, that can be used to build the dictionary. We intend to use a similar method for extracting parallel sentences across the languages albeit, the dictionary we use is a more enlarged version which not only includes the ILL but also the infobox and category information. Our methods are compared to statistical machine translation at various stages because of the language independency in it, which is our main aim. 3 Proposed Method We follow an iterative approach for achieving the task of mining dictionaries. Though we start at the same step of using ILL titles as a parallel corpus like most of the works cited in Section 2, we move into the category information, then infobox information and then into the actual text of these articles to increase the coverage of dictionaries. Figure 2 depicts the iterative nature of our approach. Figure 2: Iterative nature of the approach Figure 3 shows a scenario that can occur in building dictionary from infobox values or text where we already have prior mappings between few word pairs obtained from the previous steps. Figure 3: An example of mappings The example is an intermediate scenario of the model. Given two parallel sentences, we have the mappings of truth and ahimsa with the help of previous dictionaries. Sentences in Remaining words form the parallel corpus and are used to build the dictionary. The words in New mappings should be the dictionary formed. In the following subsections, we explain the preprocessing steps that are followed and our approach to build dictionary from each information type of the Wikipedia article. 3.1 Preprocessing We removed the stop words present in the lines generated for building parallel corpus using term frequency across Wikipedia. Also the text considered from English articles is converted to lower case to make the lexicons case-insensitive, since this does not affect our dictionaries. 3.2 Dictionary Building The dictionary building process varies for each of title, category, infobox and text of the article. The common step is to generate parallel corpus, use previously built dictionaries to eliminate words and calculate the score for the pair of words. The process involved in generating the parallel text and dictionary building for each type of text (like titles, categories, infobox) is described in the following sections.
4 3.2.1 Titles Each article in the English Wikipedia is considered to check if it has a cross lingual link to Hindi article. All such articles are filled in the map (where key value pairs are the titles of articles that have cross lingual link). The same is followed by considering Hindi Wikipedia and the map is updated with new title pairs. After performing the preprocessing described in Section 3.1 on the map, we form the parallel corpus for building the dictionary. The score of each word pairs is calculated by the Formula 1 score(we, i w j H ) = W E i W j H WE i (1) Where we i is the ith word in English wordlist; w j H is the jth word in Hindi wordlist; WE i W j H is the count of co-occurrence of we i and wj H in the parallel corpus and; WE i is the count of occurrences of the word we i in the corpus. After analyzing the top scored words, we found that a threshold of top 5 words contain the translation for majority of words. Hence top-5 scored Hindi words along with their scores are indexed to form dictionary Info box The prior step of building dictionaries from Infobox involves the process of extraction of infobox. Though infobox of English articles have specific format, infobox of other under-resourced languages (e.g. Hindi, Telugu in our case) are not so well defined. We identified patterns for matching infoboxes in the articles and extracted them. Building dictionary from infobox is a two step process. The first step is to create a map of keys across the languages so that values can be mapped to create the dictionary. In the first step, the articles from the map constructed in Section are considered and infobox of each article is extracted. Since Wikipedia has restriction on the way infobox keys are represented, a map of keys is constructed across the languages. Given that keys are mapped, corresponding values are mapped then. This forms the second step. We reduce the number of words in the value pair by removing the highly scored word pairs in dictionary 1. After this step, preprocessing described in Section 3.1 is performed and the remaining value pairs form the parallel file. The formula 1 is applied. A similar analysis like in Section to find the threshold and top 10 words are indexed, which forms dictionary Categories The process for titles is repeated here. Instead of titles pairs of cross language linked articles, we consider the categories of the articles. The only difference is the number of words that are indexed. After performing the analysis, the threshold for categories is 10. Hence top-10 words are used to build the index and the index forms the dictionary Parallel Text The initial lines of a Wikipedia article (that generally summarizes the entire article) are considered between the articles linked by cross-lingual link. We consider lines till the first heading is encountered i.e. intuitively, we consider the abstract of the article. The parallel corpus is generated from near parallel sentences that are extracted from the introduction lines of the cross language linked articles. Similarity between each line of English text with each line of Hindi text is calculated using Jaccard Similarity coefficient by using previously built dictionary 1, dictionary 2 and dictionary 3. The candidate English and Hindi sentences for which the similarity coefficient is maximum are considered to be near parallel and are used to build the parallel file. We remove the already existing mappings between the pair of lines by using the dictionaries and perform the preprocessing described in Section 3.1 before populating the parallel file. The same formula 1 is used and the score for each word pair is calculated. On analysis the top 2 scored words contain the translation and hence top 2 words with scores are indexed to form dictionary 4. 4 Results 4.1 Dataset Three sets of 300 English words each are generated from existing English-Hindi and English- Telugu dictionaries. These 300 words contain a mix of most frequently used words to less frequently used words. Frequency of the words is determined based on the Hindi and Telugu language news corpus. The words are POS tagged to perform the tag based analysis. 4.2 Evaluation Precision and recall are calculated for the dictionaries built. Precision, which measures accuracy,
5 Automated Eval Manual Eval Title based Eval Method Precision Recall Precision Recall Precision Recall Set Set Set Table 1: Manual and Automatic Evaluations (Hindi) Automated Eval Manual Eval Title based Eval Method Precision Recall Precision Recall Precision Recall Set Set Set Table 2: Manual and Automatic Evaluations (Telugu) is the total number of words that are correctly mapped to the total number of words for which mapping exist in the test set. Precision is ExtractedCorrectM appings (2) AllExtractedM appings Recall, which measures coverage, is the fraction of the test set for which a mapping exist. Recall is ExtractedCorrectM appings CorrectM appingsu singavailabledictionary (3) The evaluation is both manual and automatic because one English word can map to various words in hindi. Similarly hindi word can have various spellings (due to different characters) and as we are not using language-specific resources, root word extraction and different word forms (based on sense, POS tags etc) are not be extracted. Hence manual evaluation is required to judge the correctness of the result. For manual evaluation, the results are marked either right or wrong by native speakers whereas in automatic evaluation, we check if the result word is same as returned by any other available dictionary for the language pair. 4.3 Empirical Results The precision and recall for the 3 test sets for English-Hindi are listed in Tables 1 and 2. As mentioned in section 4.2, manual evaluation has been carried out to check the precision. Corresponding values are also listed in Tables 1 and 2 The precision when evaluated manually is high compared to automatic evaluation. This can be attributed to few factors like 1. Various context-based translation for a single English word. 2. Different word form of the word returned and that present in the dictionary. 3. Different spelling for the same word. (different characters) The results for the same datasets with baseline dictionary built only using the titles are listed in Tables 1 and 2. The recall is low when using titles. The precision achieved by manual evaluation is very high compared to that of baseline dictionary. The various F-Measures for each dataset using 3 methods are listed in Tables 3 and 4. The principle goal of using Wikipedia is to cover large terminology and hence recall is as important as precision. In such cases, our method outperforms the baseline dictionary. In case of proper nouns, where existing MT methods fail due to unavailability of parallel corpus, our method gives encouraging results, as listed in Table 5. Precision Recall F-Measure(F 1) Table 5: Accuracy for proper nouns (Hindi) The coverage of dictionaries can be estimated by considering the number of words that are present in the dictionary. Figure 4 details number of unique words(y-axis) that are freshly added from each category(x-axis). The maximum word count for bilingual pairs in (Tyers and Pienaar, 2008) is 4913, which is very less when compared with our system. Though the number of articles and the cross-language links affect the coverage of the dictionaries, our system performs considerably well in extracting biligual word pairs (dictio-
6 Set-1 Set-2 Set-3 Method F 0.5 F 1 F 2 F 0.5 F 1 F 2 F 0.5 F 1 F 2 Baseline dictionary Automatic Evaluation Manual Evaluation Table 3: F-Measure Accuracy (Hindi) Set-1 Set-2 Set-3 Method F 0.5 F 1 F 2 F 0.5 F 1 F 2 F 0.5 F 1 F 2 Baseline dictionary Automatic Evaluation Manual Evaluation Table 4: F-Measure Accuracy (Telugu) Approach Precision Recall Existing(High precision) Existing(High recall) Our approach(hindi) Table 6: Comparing the results with other existing approaches Figure 4: Coverage of Dictionaries(Y-axis) with respect to each information type(x-axis) nary). It is found that dictionaries generated by titles and infobox are more accurate than the dictionaries constructed using categories and text. Another interesting observation is that the word pairs generated from categories and text, though, not accurate, are related to the query word. This feature can be exploited in CLIA systems where query expansion in the target language plays a vital role. The final summary of the results attained are listed in Tables 3 and Comparison with other existing dictionary building systems Table 6 compares our method with (Erdmann et al., 2008), who built a English-Japanese dictionary using Wikipedia using various approaches. The results considered from their work are the precision and recall their system attained for high and low frequency words. For high frequency words, they tuned their system to achieve high precision whereas for low frequency words, their system concentrated on achieving high recall. Our results are the average precision and recall over the 3 datasets. These statistics shows that our method performs on par with the existing methods of automatic multilingual dictionaries using Wikipedia. 5 Conclusion and future work We have described the automatic construction of a bilingual dictionary over Wikipedia, by an iterative method. Overall, the construction process consists of identifying near comparable corpora from title, infobox, category and text of the articles linked by ILL. Our focus has been low resourced languages. The system is extensible considering it is language-independent and no external language resource has been used. We are working to apply the system to query building in CLIA systems. Before moving beyond Wikipedia, we want to consider entire text of the Wiki-article and other structure information like headings and anchor text to enhance the coverage of the dictionary. Further, multi-word translation is also envisioned. Our work will be of particular importance to the under-resourced languages where the comparable corpora are not easily obtainable. References S.F. Adafre and M. de Rijke Finding similar sentences across multiple languages in wikipedia. NEW TEXT Wikis and blogs and other dynamic text sources, page 62.
7 J. W. Breen JMdict:A Japanese-Multilingual Dictionary. In COLING Multilingual Linguistic Resources Workshop, pages P.F. Brown, J. Cocke, S.A.D. Pietra, V.J.D. Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P.S. Roossin A statistical approach to machine translation. Computational linguistics, 16(2):85. M. Erdmann, K. Nakayama, T. Hara, and S. Nishio An approach for extracting bilingual terminology from wikipedia. In Database Systems for Advanced Applications, pages Springer. M. Erdmann, K. Nakayama, T. Hara, and S. Nishio Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 5(4):1 17. P. Fung and K. McKeown A technical wordand term-translation aid using noisy parallel corpora across language groups. Machine Translation, 12(1): M. Kay and M. Roscheisen Text-translation alignment. computational Linguistics, 19(1): F. Sadat, M. Yoshikawa, and S. Uemura Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 2, pages Association for Computational Linguistics. F.M. Tyers and J.A. Pienaar Extracting bilingual word pairs from Wikipedia. Collaboration: interoperability between people in the creation of language resources for less-resourced languages, page 19. N. Wardrip-Fruin and N. Montfort The New- MediaReader. The MIT Press.
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
An Iterative Link-based Method for Parallel Web Page Mining
An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)
Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger European Commission Joint Research Centre (JRC) https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292
Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment
Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm
Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm Peng Li and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Discovering suffixes: A Case Study for Marathi Language
Discovering suffixes: A Case Study for Marathi Language Mudassar M. Majgaonker Comviva Technologies Limited Gurgaon, India Abstract Suffix stripping is a pre-processing step required in a number of natural
Named Entity Recognition in Broadcast News Using Similar Written Texts
Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium [email protected] ivan.vulic@@cs.kuleuven.be Abstract
Cross-Lingual Concern Analysis from Multilingual Weblog Articles
Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/
SMSFR: SMS-Based FAQ Retrieval System
SMSFR: SMS-Based FAQ Retrieval System Partha Pakray, 1 Santanu Pal, 1 Soujanya Poria, 1 Sivaji Bandyopadhyay, 1 Alexander Gelbukh 2 1 Computer Science and Engineering Department, Jadavpur University, Kolkata,
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR
A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR Dan Wu School of Information Management Wuhan University, Hubei, China [email protected] Daqing He School of Information
KHRESMOI. Medical Information Analysis and Retrieval
KHRESMOI Medical Information Analysis and Retrieval Integrated Project Budget: EU Contribution: Partners: Duration: 10 Million Euro 8 Million Euro 12 Institutions 9 Countries 4 Years 1 Sep 2010-31 Aug
Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, 2002. Reading (based on Wixson, 1999)
Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, 2002 Language Arts Levels of Depth of Knowledge Interpreting and assigning depth-of-knowledge levels to both objectives within
Text Mining: The state of the art and the challenges
Text Mining: The state of the art and the challenges Ah-Hwee Tan Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore 119613 Email: [email protected] Abstract Text mining, also known as text data
Database Design For Corpus Storage: The ET10-63 Data Model
January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million
OPTIMIZING CONTENT FOR TRANSLATION ACROLINX AND VISTATEC
OPTIMIZING CONTENT FOR TRANSLATION ACROLINX AND VISTATEC We ll look at these questions. Why does translation cost so much? Why is it hard to keep content consistent? Why is it hard for an organization
BILINGUAL TRANSLATION SYSTEM
BILINGUAL TRANSLATION SYSTEM (FOR ENGLISH AND TAMIL) Dr. S. Saraswathi Associate Professor M. Anusiya P. Kanivadhana S. Sathiya Abstract--- The project aims in developing Bilingual Translation System for
Hybrid Approach to English-Hindi Name Entity Transliteration
Hybrid Approach to English-Hindi Name Entity Transliteration Shruti Mathur Department of Computer Engineering Government Women Engineering College Ajmer, India [email protected] Varun Prakash Saxena
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
Improve your English and increase your employability with EN Campaigns
Improve your English and increase your employability with EN Campaigns Being able to communicate in English is becoming increasingly important in today's global economy. We provie a high quality and accessible
English to Arabic Transliteration for Information Retrieval: A Statistical Approach
English to Arabic Transliteration for Information Retrieval: A Statistical Approach Nasreen AbdulJaleel and Leah S. Larkey Center for Intelligent Information Retrieval Computer Science, University of Massachusetts
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8
Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8 Pennsylvania Department of Education These standards are offered as a voluntary resource
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh P rabhu 2 Shilpa Desai 1 Hanumant Redkar 1 N eha P rabhugaonkar 1 Apur va N agvenkar 1 Ramdas Karmali 1 (1) GOA
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Finding Advertising Keywords on Web Pages. Contextual Ads 101
Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The
French Language and Culture. Curriculum Framework 2011 2012
AP French Language and Culture Curriculum Framework 2011 2012 Contents (click on a topic to jump to that page) Introduction... 3 Structure of the Curriculum Framework...4 Learning Objectives and Achievement
Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu
Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!
Academic Standards for Reading, Writing, Speaking, and Listening
Academic Standards for Reading, Writing, Speaking, and Listening Pre-K - 3 REVISED May 18, 2010 Pennsylvania Department of Education These standards are offered as a voluntary resource for Pennsylvania
Translating Collocations for Bilingual Lexicons: A Statistical Approach
Translating Collocations for Bilingual Lexicons: A Statistical Approach Frank Smadja Kathleen R. McKeown NetPatrol Consulting Columbia University Vasileios Hatzivassiloglou Columbia University Collocations
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Using social technologies in Computer Assisted Language Learning: development of a theoretical and methodological framework
Using social technologies in Computer Assisted Language Learning: development of a theoretical and methodological framework Antigoni Parmaxi Department of Multimedia and Graphic Arts Cyprus University
Getting Off to a Good Start: Best Practices for Terminology
Getting Off to a Good Start: Best Practices for Terminology Technologies for term bases, term extraction and term checks Angelika Zerfass, [email protected] Tools in the Terminology Life Cycle Extraction
The. Languages Ladder. Steps to Success. The
The Languages Ladder Steps to Success The What is it? The development of a national recognition scheme for languages the Languages Ladder is one of three overarching aims of the National Languages Strategy.
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION
TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION Dr. Siddhartha Ghosh 1, Sujata Thamke 2 and Kalyani U.R.S 3 1 Head of the Department of Computer Science & Engineering,
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Students will know Vocabulary: purpose details reasons phrases conclusion point of view persuasive evaluate
Fourth Grade Writing : Text Types and Purposes Essential Questions: 1. How do writers select the genre of writing for a specific purpose and audience? 2. How do essential components of the writing process
Language Arts Literacy Areas of Focus: Grade 6
Language Arts Literacy : Grade 6 Mission: Learning to read, write, speak, listen, and view critically, strategically and creatively enables students to discover personal and shared meaning throughout their
Writing learning objectives
Writing learning objectives This material was excerpted and adapted from the following web site: http://www.utexas.edu/academic/diia/assessment/iar/students/plan/objectives/ What is a learning objective?
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,
Statistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
QUT Digital Repository: http://eprints.qut.edu.au/
QUT Digital Repository: http://eprints.qut.edu.au/ Lu, Chengye and Xu, Yue and Geva, Shlomo (2008) Web-Based Query Translation for English-Chinese CLIR. Computational Linguistics and Chinese Language Processing
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Students will know Vocabulary: claims evidence reasons relevant accurate phrases/clauses credible source (inc. oral) formal style clarify
Sixth Grade Writing : Text Types and Purposes Essential Questions: 1. How do writers select the genre of writing for a specific purpose and audience? 2. How do essential components of the writing process
The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)
The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students
69 A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students Sarathorn Munpru, Srinakharinwirot University, Thailand Pornpol Wuttikrikunlaya, Srinakharinwirot University,
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files
Journal of Universal Computer Science, vol. 21, no. 4 (2015), 604-635 submitted: 22/11/12, accepted: 26/3/15, appeared: 1/4/15 J.UCS From Terminology Extraction to Terminology Validation: An Approach Adapted
Report on the embedding and evaluation of the second MT pilot
Report on the embedding and evaluation of the second MT pilot quality translation by deep language engineering approaches DELIVERABLE D3.10 VERSION 1.6 2015-11-02 P2 QTLeap Machine translation is a computational
Automatic Generation of Bid Phrases for Online Advertising
Automatic Generation of Bid Phrases for Online Advertising Sujith Ravi, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, Sandeep Pandey, Bo Pang ISI/USC, 4676 Admiralty Way, Suite 1001, Marina Del
Aligning the World Language Content Standards for California Public Schools with the Common Core Standards. Table of Contents
Aligning the World Language Standards for California Public Schools with the Common Core Standards Table of s Page Introduction to World Language Standards for California Public Schools (WLCS) 2 World
A Comparative Study on Sentiment Classification and Ranking on Product Reviews
A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
Tagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
Statistical Machine Translation of French and German into English Using IBM Model 2 Greedy Decoding
Statistical Machine Translation of French and German into English Using IBM Model 2 Greedy Decoding Michael Turitzin Department of Computer Science Stanford University, Stanford, CA [email protected]
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
