Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
|
|
|
- Pamela Pope
- 10 years ago
- Views:
Transcription
1 Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia {l4.tang,s.geva, yue.xu}@qut.edu.au 2 Department of Computer Science, University of Otago, Dunedin, New Zealand [email protected] ABSTRACT In this paper we describe our approaches to retrieving crosslingual documents for question answering in the NTCIR ACLIA-IR4QA task. A few Chinese indexing techniques were used in our experiments. We mainly focused on using external recourses: web documents and Wikipedia for the key phrase identification, translation and query expansion. The evaluation shows encouraging results of our system. Categories and Subject Descriptors H.3.1 Content Analysis and Indexing Indexing methods, Linguistic processing; H.3.3 Information Search and Retrieval Query formulation, Search process. General Terms Algorithms, Design, Experimentation. Keywords NTCIR, Wikipedia, NGMI, Dual Indexing. 1. Introduction In NTCIR-8, we participated in the Advanced Cross-lingual Information Access (ACLIA) IR4QA task, and submitted runs for the English to Chinese CLIR task (both English-to- Simplified Chinese and English-to-Traditional Chinese). IR4QA is an embedded component of an overall cross-lingual question answering (CLQA) system [1, 2]. The answer for a question could be either the result of an End-to-End QA system, or a combination of different QA systems and IR systems. In both cases, the candidate answers for the question are extracted from documents retrieved in a CLIR stage. CLIR plays a very important role in an overall question answering system because the relevancy of retrieved passages or documents determines the accuracy of the answer. A simple approach to achieving CLIR is to translate the query into the language of the documents and to use a mono-lingual IR system. However, for this it is essential to identify the key phrases in the question and to correctly translate them. A good way to achieve high accuracy for query translation is to apply an English POS tagger[3] to the query, to extract the key phrases, and then to translate them. An alternative is to use a statistical machine translation toolkit[4] to obtain bi-lingual phrase pairs and to translate on a phrase by phrase basis. Nowadays, dictionary-based and statistical machine translation can achieve very high accuracy levels when translating general text. However, the complex phrases and possible ambiguity present in a question tax general purpose machine translation approaches. Out of vocabulary terms are particularly a problem. Question Example 1: Question and three translations Google Translate 1 (traditional Chinese) Google Translate (simplified Chinese) correct translation What is the relationship between the movie "Riding Alone for Thousands of Miles" and ZHANG Yimou 什 麼 是 電 影 的 關 係 單 騎 千 里 和 張 藝 謀 之 间 有 什 么 电 影 利 民 为 千 里 单 独 的 关 系 和 张 艺 谋 张 艺 谋 与 电 影 千 里 走 单 骑 是 什 么 关 系 In example 1, neither of the Google translations for ZHANG Yimou s movie "Riding Alone for Thousands of Miles" ( 千 里 走 单 骑 ) is correct. For the traditional Chinese version of the movie name, the translation is partly correct, and it is reasonable to expect it will be an effective query. But for the simplified Chinese version, the translated movie title is nonsense and is full of noise words. In this example, "Riding Alone for Thousands of Miles" is an out- of-vocabulary phrase. Rather than using the traditional machine translation methods described above, we adopted a far simpler strategy. We use a web search engine (Google) to locate a related Wikipedia entry for the question, and then use the title of Chinese page for translation. Query expansion is done using Chinese text (called clue text) encountered in the English results. A detailed description of the approach is presented in section Chinese Document Indexing More than the case with western languages, the methods used in IR indexing of Chinese text vary between research groups. 1
2 IR System Unigrams, bigrams and words are all common tokens used when indexing Chinese text. Different word segmentation algorithms might be used as might different ranking functions. The performance of various IR systems combining different models can vary substantially [5, 6]. In our experiments for IR4QA, we used n-gram mutual information (NGMI) [7] to segment Chinese text. NGMI is an unsupervised n-gram word segmentation approach. It is derived from character-based mutual information, but can additionally recognize words longer than two characters. In order to find the most suitable segmentation strategy for our CLIR system, unigram indexing and dual indexing (unigrams and NGMI segmentation) were used in different experimental runs. 3. QUERY PROCESSING 3.1 Question Analysis independent. Table 1 presents two examples of the term-chunks after stop words have been removed. Table 1. Example term chunks after stop word removal Topic ID ACLIA2-CT-0001 ACLIA2-CT-0024 Terms Chunk so-called steroid give short descriptions, isolation, Heping Branch, Taipei city hospital caused, SARS, 24 th April, 2003 Question? Remove Stop words Example 2: English-to-Chinese NTCIR8- ACLIA topics topic id ACLIA2-CS-0002 question(en) What is the relationship between the movie "Riding Alone for Thousands of Miles" and ZHANG Yimou? question(zh) 千 里 走 单 骑 和 张 艺 谋 是 什 么 关 系? narrative(en) The user wants to know the relationship between the movie "Riding Alone for Thousands of Miles" and ZHANG Yimou. narrative(zh) 使 用 者 想 知 道 千 里 走 单 骑 和 张 艺 谋 之 间 有 何 关 系 Four-stage keyphrase translation and expansion Google search 1 Google search 2 Chinese collected from search results 1 2 Term chunks English Wikipedia Chinese Wikipedia Title of Chinese Wikipedia page 3 Chinese Query Terms 4 Google online translation Translations topic id question(en) question(zh) narrative(en) narrative(zh) ACLIA2-CT-0024 Give short descriptions to the isolation of the Heping Branch of Taipei city hospital caused by SARS on 24th of April [ 請 簡 述 2003 年 4 月 24 日, 和 平 醫 院 因 SARS 封 院 的 始 末 The analyst would like to know the truth that Taipei City Hospital - Heping Branch had to be isolated from others because of group SARS cases. 使 用 者 想 知 道 2003 年 4 月 24 日, 和 平 醫 院 爆 發 集 體 感 染 SARS 病 例 而 封 院 的 經 過 Example 2 shows two queries from the English-to-Chinese NTCIR8-ALCIA topic set. They are complex questions and it is important to be able to precisely translate them. Commonly, topics are classified into different domains and receive special processing accordingly. The quality of the query can be improved by providing additional terms for query expansion or removing noise words. For example, recognising a biography question can help in query expansion by appending the query with the term born ; or removing the noise phrase 是 什 么 关 系 (what is the relation between) for relationship questions. In NTCIR 8 ACLIA there are a few pre-defined question types: DEFINITION, BIOGRAPHY, RELATIONSHIP, EVENT, WHY, PERSON, ORGANIZATION, LOCATION and DATE. Rather than using complex natural language processing, we employ a naive question processing method that simply breaks the sentence down into term chunks by removing stop words. The stop word list contains 313 words extracted from Medline[8]. The benefit of this method is that it is question 3.2 CLIR System Architecture We adopted a simple all-in-one strategy for query key term recognition, translation, query expansion, and to address the out- of-vocabulary problem. It is depicted in Figure 1. A Google search for the best English Wikipedia pages is done using the English question. The Chinese equivalent page is found by following the language link. Finally the title of the Chinese Wikipedia page is used as the translation of the query. If this fails we use Google Translate. This approach relies on following observations: Indices Relevant documents for QA processing Chinese corpus Figure 1. The design of CLIR system.. The Chinese Wikipedia has over 100,000 entries describing various events, people, organizations, locations, and facts. Most importantly, there are links between English articles and their corresponding Chinese counterparts. When people post information on the Internet, they often provide a translation (where necessary) in the same web documents. These pages contain bi-lingual phrase pairs.
3 A web search engines such as Google identify Wikipedia entries and bi-lingual web documents that are closely related to a query. In summary, we use three different sources for query translation in our strategy. They are listed in Table 2. Table 2: Translation sources Name Method Google bi-lingual web documents Wikipedia multilingual encyclopedia Google Translate machine translation [9] 3.3 The Query Translation Algorithm The Chinese translation of the English question is generated from the following steps: 1. Remove stop words from the English question to generate term chunks. 2. Search Google using the term chunks with options set to return only the Chinese and English page. If there is a Wikipedia page at the top 10 then go to step 4, else go to step 3. In both cases Chinese text present in the English results list is collected this clue text is used for query expansion. 3. Calculate the frequency of words appearing in the title of the results from step 2. Choose the top ranking ones with a frequency above a threshold. The threshold is set to 3 to begin with then linearly decreased by one until at least one term is found. Search Google using these words, but restrict the search to the English Wikipedia. The highest ranking result containing any term from the question is considered the correct result. 4. Retrieve the English Wikipedia page. 5. Retrieve the corresponding Chinese Wikipedia page by following the languages link from the English page. If there is no Chinese link found, then jump to step Extract the title of the Chinese Wikipedia page, and use it as the query. 7. Process the clue text if there is any for query expansion. Either the whole or some frequent words of the clue text could be used as expansion terms. Append these additional terms to the Chinese query. The frequent words are chosen from the NGMI segmented terms in the clue text. The frequency of those words has to be above a threshold. The threshold is set to 5 to begin with, then linearly decreased by one until at least one term is found. 8. If no Chinese query terms have been found then use the translation from Google Translate on the term chunks.. Otherwise, the results of this machine translation are optionally appended to the query. 9. Search the corpus using either: single character segmentation; or single character segmentation plus NGMI segmented terms. The resulting query can be thought-of as: Chinese query = The title of the Chinese Wikipedia document found by searching Google for the English Wikipedia document and following the languages link + Chinese clue text collected from Google search (all or high frequency words) + Result from Google Translate (optional) 4. Weighting Model A slightly modified BM25 ranking function was used for document ordering. When calculating the inverse document frequency, we use: IDF(q i ) = log N n Where N is the number of documents in the corpus, and n is the document frequency of query term q i. The retrieval status value of a document d with respect to query q(q 1,, q m ) is calculated as: rsv q, d = m tf q i,d k i=0 tf q i,d + k 1 1 b + b len d avgdl IDF q i Where tf(q i, d) is the term frequency of term q i in document d; len d is the length of document d in words and avgdl is the mean document length. When indexing using the dual indexing strategy only the actual words (not the unigrams) are counted towards the document length. The values of tunable parameters k 1 and b used in our experiments are: 0.9 and 0.4 respectively. 5. Experiments We used combinations of different segmentation approaches and query translation approaches to generate five different runs for each English-to-Chinese task. The detail of the differences between these runs is given in Table 7 and Table 8. The 01 and 02 runs used blind translation from Google and Wikipedia with translation from the online translation service (if Wikipedia and web document based translation failed). The 03, 04, and 05 runs combine the translation results from three different sources: web documents, Wikipedia, and machine translation. We are also interested in investigating how segmentation affects cross-lingual IR. To test this we employed two different indexing and query segmentation strategies in our runs. The 01 and 03 runs used only single Chinese character indexing and searching. The 02, 04, and 05 runs used dual indexing and searched the documents using single characters and words. In our experiments web documents form an important data source for query expansion. As the source language of the web documents may affect the results, we created two additional runs to examine this. Run EN-CS 05 searched only in the cn domain; while run EN-CT 05 search only in the tw domain. Overall, the 01, 02, 03, 04 runs share the same Google search results and clue text, but run 05 is different. For different experimental run groups, the statistics of the number of discovered Wikipedia pages, collected clue text and total topics in which web search failed are given in Table 3 and Table 4. From those two tables we can see that a large portion, almost 9/10, of the questions found their related Wikipedia pages. Using bilingual search on Google contributes about 1/3 of the clue text. It also can be seen that domain search is a good way to increase the chance of finding more clue text as the amount of clue text nearly triples for the EN-CS runs, and quadruples for the EN-CT runs. However, it reduces the possibility of (1) (2)
4 finding the relevant Wikipedia page because of the restrictive search. Table 3. The statistics of the EN-CS runs. #FAIL is the number of total topics for which the web search could not obtain either a Chinese Wikipedia page or Chinese clue text. Run # EN # ZH # Clue # FAIL Wiki Wiki Text 01, 02, 03, Table 4. The statistics of the EN-CT runs. #FAIL is the number of total topics for which the web search could not obtain either a Chinese Wikipedia page or Chinese clue text. Run # EN # ZH # Clue # FAIL Wiki Wiki text 01, 02, 03, IR4QA Results AND DISCUSSION The official evaluation results (before bug fixes) of our submissions on the two tasks are given in Table 5 and Table 6. It can be seen that the distributions of the mean average precision (MAP), mean Q (MQ), and mean ndcg (MnDCG) are similar for both the EN-CS and EN-CT tasks. Runs 01 and 02 that use just web documents and the Wikipedia (where possible) achieve a relatively low precision. This, however, does indicate that web documents and the Wikipedia may be good resources for query translation. The runs using NGMI word dual indexing (runs 02, 04, and 05) bettered the equivalent character based indexing in all cases. In both cases the runs that restricted the web search to a particular domain (run 05) performed worse than the equivalent run that did not site restrict. As can be observed from Table 3 and Table 4, the probability of finding clue text increases, and the clue text collected in Google results gets larger due to the specified Chinese domain search. However, as a side-effect, noise words increase, and the hope for finding a relevant Wikipedia page become slim. Therefore, the performance of runs with site restriction could deteriorate if the increased noise words are not taken care of. It is also interesting to see that the QUTIS-EN-CS runs, QUTIS-EN-CS-01-T particularly, contribute the highest number of unique relevant documents in all CS submissions according to the official NTCIR assessment results[10]. This could be largely because our Wikipedia and web document based query translation and expansion approach contributes extra unexpected good query terms. Overall, our runs did not perform well when compared to the best run submitted to the task and so care must be taken when drawing conclusions. 7. CONCLUSIONS AND FUTURE WORK We implemented a simple strategy for query translation and expansion for cross-lingual question answering. This strategy relies on an external resource such as web documents and the Wikipedia to tackle the out-of-vocabulary problem. The performance of our system and highest number of unique relevant documents found in EN-CS experimental runs are encouraging. In future work we will use logical and contextual analysis to look for better translations of out-of-vocabulary phrases. We will also continue our work in using the Wikipedia as a method of translation and query expansion. Table 5: Official EN-CS IR4QA results BEFORE bug fixes RUN ID MAP MQ MnDCG EN-CS best QUTIS-EN-CS-01-T QUTIS-EN-CS-02-T QUTIS-EN-CS-03-T QUTIS-EN-CS-04-T QUTIS-EN-CS-05-T Table 6: Official EN-CT IR4QA results BEFORE bug fixes RUN ID MAP MQ MnDCG EN-CT best QUTIS-EN-CT-01-T QUTIS-EN-CT-02-T QUTIS-EN-CT-03-T QUTIS-EN-CT-04-T QUTIS-EN-CT-05-T References [1] T. Mitamura, et al., "Overview of the NTCIR-7 ACLIA Tasks: Advanced Cross-Lingual Information Access," in Proceedings of NTCIR-7, 2008, pp [2] T. Mitamura, et al., "Overview of the NTCIR-8 ACLIA Tasks: Advanced Cross-Lingual Information Access," in Proceedings of NTCIR-8, to appear, [3] L. Shi, et al., "RALI Experiments in IR4QA at NTCIR-7," in NTCIR-7, 2008, pp [4] W. Luo, et al., "ICT-Crossn: The System of Cross-lingual Information Retrieval of ICT in NTCIR-7," 2008, pp [5] W. P. L. Robert and K. L. Kwok, "A comparison of Chinese document indexing strategies and retrieval models," ACM Transactions on Asian Language Information Processing (TALIP), vol. 1, pp , [6] A. Chen, et al., "Chinese text retrieval without using a dictionary," in SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp [7] L.-X. Tang, et al., "Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information," presented at the 14th Australasian Document Computing Symposium (ADCS 2009), University of New South Wales, Sydney, [8] U.S. National Library of Medicine. MEDLINE /PubMed Baseline Repository. Available: [9] Wikipedia, "Google Translate," ed: [10] T. Sakai, et al., "Overview of NTCIR-8 ACLIA IR4QA," in Proceedings of NTCIR-8, to appear, 2010.
5 Table 7: EN-CS Runs Description RUNID Index Units Translation Query Units IR Model QUTIS-EN-CT-01-T unigram 1.Web search + 2. [google translate, if 1 fails] unigram BM25 QUTIS-EN-CT-02-T unigram + word 1.web search + 2. [google translate, if 1 fails] unigram + word BM25 QUTIS-EN-CT-03-T unigram web search + google translate unigram BM25 QUTIS-EN-CT-04-T unigram + word web search + google translate unigram + word BM25 QUTIS-EN-CT-05-T unigram + word web search (site: tw) + google translate unigram + word BM25 Table 8: EN-CT Runs Description RUNID Index Units Translation Query Units IR Model QUTIS-EN-CS-01-T unigram 1.web search + 2. [google translate, if 1 fails] unigram BM25 QUTIS-EN-CS-02-T unigram + word 1.web search + 2. [google translate, if 1 fails] unigram + word BM25 QUTIS-EN-CS-03-T unigram web search + google translate unigram BM25 QUTIS-EN-CS-04-T unigram + word web search + google translate unigram + word BM25 QUTIS-EN-CS-05-T unigram + word web search (site:cn) + google translate unigram + word BM25
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
QUT Digital Repository: http://eprints.qut.edu.au/
QUT Digital Repository: http://eprints.qut.edu.au/ Lu, Chengye and Xu, Yue and Geva, Shlomo (2008) Web-Based Query Translation for English-Chinese CLIR. Computational Linguistics and Chinese Language Processing
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292
TREC 2003 Question Answering Track at CAS-ICT
TREC 2003 Question Answering Track at CAS-ICT Yi Chang, Hongbo Xu, Shuo Bai Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected] http://www.ict.ac.cn/
A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR
A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR Dan Wu School of Information Management Wuhan University, Hubei, China [email protected] Daqing He School of Information
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects
Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi [email protected]
Finding Advertising Keywords on Web Pages. Contextual Ads 101
Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Query term suggestion in academic search
Query term suggestion in academic search Suzan Verberne 1, Maya Sappelli 1,2, and Wessel Kraaij 2,1 1. Institute for Computing and Information Sciences, Radboud University Nijmegen 2. TNO, Delft Abstract.
An Unsupervised Approach to Domain-Specific Term Extraction
An Unsupervised Approach to Domain-Specific Term Extraction Su Nam Kim, Timothy Baldwin CSSE NICTA VRL University of Melbourne VIC 3010 Australia [email protected], [email protected] Min-Yen Kan School of
Cross-Lingual Concern Analysis from Multilingual Weblog Articles
Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/
The University of Amsterdam s Question Answering System at QA@CLEF 2007
The University of Amsterdam s Question Answering System at QA@CLEF 2007 Valentin Jijkoun, Katja Hofmann, David Ahn, Mahboob Alam Khalid, Joris van Rantwijk, Maarten de Rijke, and Erik Tjong Kim Sang ISLA,
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value
, pp. 397-408 http://dx.doi.org/10.14257/ijmue.2014.9.11.38 Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value Mohannad Al-Mousa 1
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
The University of Lisbon at CLEF 2006 Ad-Hoc Task
The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports
Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering
Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering Guangyou Zhou, Kang Liu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)
The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit
On the Fly Query Segmentation Using Snippets
On the Fly Query Segmentation Using Snippets David J. Brenes 1, Daniel Gayo-Avello 2 and Rodrigo Garcia 3 1 Simplelogica S.L. [email protected] 2 University of Oviedo [email protected] 3 University
LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task
LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n.
Optimization of Internet Search based on Noun Phrases and Clustering Techniques
Optimization of Internet Search based on Noun Phrases and Clustering Techniques R. Subhashini Research Scholar, Sathyabama University, Chennai-119, India V. Jawahar Senthil Kumar Assistant Professor, Anna
Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
An Iterative Link-based Method for Parallel Web Page Mining
An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Document Image Retrieval using Signatures as Queries
Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and
Analysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400
Automatic Text Processing: Cross-Lingual. Text Categorization
Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo
Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies
Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies Francisco Carrero 1, José Carlos Cortizo 1,2, José María Gómez 3 1 Universidad Europea de Madrid, C/Tajo s/n, Villaviciosa
Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham
Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control Phudinan Singkhamfu, Parinya Suwanasrikham Chiang Mai University, Thailand 0659 The Asian Conference on
Finding Advertising Keywords on Web Pages
Finding Advertising Keywords on Web Pages Wen-tau Yih Microsoft Research 1 Microsoft Way Redmond, WA 98052 [email protected] Joshua Goodman Microsoft Research 1 Microsoft Way Redmond, WA 98052 [email protected]
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS Alok Ranjan Pal 1, 3, Anirban Kundu 2, 3, Abhay Singh 1, Raj Shekhar 1, Kunal Sinha 1 1 College of Engineering and Management,
Korean-Chinese Person Name Translation for Cross Language Information Retrieval
Korean-Chinese Person Name Translation for Cross Language Information Retrieval Yu-Chun Wang ab, Yi-Hsun Lee a, Chu-Cheng Lin ac, Richard Tzong-Han Tsai d*, Wen-Lian Hsu a a Institute of Information Science,
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Stamatina Thomaidou 1,2, Konstantinos Leymonis 1,2, Michalis Vazirgiannis 1,2,3 Presented by: Fragkiskos Malliaros 2 1 : Athens
Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. [email protected]
Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia
Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia Martin Scaiano University of Ottawa [email protected] Diana Inkpen University of Ottawa [email protected] Abstract
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015
RESEARCH ARTICLE Multi Document Utility Presentation Using Sentiment Analysis Mayur S. Dhote [1], Prof. S. S. Sonawane [2] Department of Computer Science and Engineering PICT, Savitribai Phule Pune University
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
Pre-processing of Bilingual Corpora for Mandarin-English EBMT
Pre-processing of Bilingual Corpora for Mandarin-English EBMT Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie Language Technologies Institute, Carnegie Mellon University NSH, 5000 Forbes Ave. Pittsburgh,
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Towards a Visually Enhanced Medical Search Engine
Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The
HPI in-memory-based database system in Task 2b of BioASQ
CLEF 2014 Conference and Labs of the Evaluation Forum BioASQ workshop HPI in-memory-based database system in Task 2b of BioASQ Mariana Neves September 16th, 2014 Outline 2 Overview of participation Architecture
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students
69 A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students Sarathorn Munpru, Srinakharinwirot University, Thailand Pornpol Wuttikrikunlaya, Srinakharinwirot University,
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group
Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
