Hybrid Approach to English-Hindi Name Entity Transliteration
|
|
|
- Hugo Francis
- 9 years ago
- Views:
Transcription
1 Hybrid Approach to English-Hindi Name Entity Transliteration Shruti Mathur Department of Computer Engineering Government Women Engineering College Ajmer, India Varun Prakash Saxena Department of Computer Engineering Government Women Engineering College Ajmer, India Abstract Machine translation (MT) research in Indian languages is still in its infancy. Not much work has been done in proper transliteration of name entities in this domain. In this paper we address this issue. We have used English-Hindi language pair for our experiments and have used a hybrid approach. At first we have processed English words using a rule based approach which extracts individual phonemes from the words and then we have applied statistical approach which converts the English into its equivalent Hindi phoneme and in turn the corresponding Hindi word. Through this approach we have attained 83.40% accuracy. Keywords Name Entity, Machine Transliteration, Hybrid Approach, Phoneme Identification. I. INTRODUCTION Research in machine Translation (MT) is sixty years old. Still we do not have an MT engine which can provide a good translation. One of the problems for this is improper translation/transliteration of name entities (NEs). Most of the MT engines are not able to address this issue and thus provide poor quality translations. Name entities are basically nouns in a text which are categorized by predefined categories like person name, organization name, location, date and time etc. Most of the MT engines are unable to translate these texts properly. To understand this, let us consider some of the translations produced by Google MT Engine. English Sentence: Ram is studying in Malviya National Institute of Technology. Hindi Translation: र म य गक क म लव य न शनल इ ट य ट म पढ़ रह ह Observations: Ram and Malviya National Institute of Technology are name entities. Only Ram was correctly transliterated. Malviya National Institute of Technology should have been transliterated as म लव य न शनल इ ट य ट ऑफ ट न ल ज or म लव य र य य गक स थ न. English Sentence: Mary May is going to meet Susan June. Hindi Translation: म र मई स स न ज न क प र करन ज रह ह Observations: Mary May and Susan June are names of people which are not transliterated properly. English Sentence: Amber Fort was built by Raja Man Singh in 1592 AD in Jaipur. Hindi Translation: ए बर कल जयप र म 1592 ई. म र ज आदम स ह न बनव य थ. Observations: Amber, Raja Man Singh, 1592 AD and Jaipur are NEs. Amber and Raja Man Singh have not been translated/transliterated properly. By looking at these examples we can clearly understand the need for proper mechanism to handle name entities as without it quality of translation would get affected. In this paper we shall be addressing this issue. We shall be devising a mechanism to implement a proper transliteration mechanism which will handle NEs and would help in improving the quality of machine translated output. The rest of the paper is organized as follows: section II gives an outline of the work done to handle name entities and English-Hindi Machine Transliteration. Section II describes is approach. Section IV shows evaluation and results of the study and Section V concludes the paper. II. LITERATURE SURVERY Many researchers have studied the use of machine transliteration for research in name entity translation or recognition. Babych and Hartley [1] implemented an automatic name entity recognition system which was done on the outputs generated by five different machine translation systems. They incorporated GATE s information extraction module in their systems concluded that combining IE technology with machine translation has a great potential for improving the overall output quality. Al-Onaizan and Knight [2] developed an algorithm for translation Arabic-English name entity phrases. They used both monolingual and bilingual resources and compared their results with the results produced by human translators and some commercial MT systems. They showed that their system had better correlation
2 with human translators than any other system. They achieved an accuracy of 84%. Hassan et al. [3] performed a similar study which was done on extracted translation pairs. They showed that by using their approach the performance of a named entity translation system improves. Jiang et al. [4] used transliteration with web mining in translation of name entities. They used a maximum entropy based approach to train a classifier on pronunciation similarity, bilingual context and co-occurrence. This classifier was used to rank the candidate translations produced. Yeh et al. [5] proposed a pattern matching method for finding name entity s translation online. They developed an algorithm which automatically generated and weighted pattern which were used to search for name entities from bilingual corpus. In an Indian context, Joshi and Mathur [6] proposed a phonetic mapping based algorithm for English-Hindi transliteration system which created a mapping table and a set of rules for transliteration of text. Joshi et al. [7] also proposed a predictive approach of for English-Hindi transliteration. Here instead of generated a single output they provided a list of possible text that can be selected by the user for correct transliteration. They looked at the partial text and tried to provide possible complete list as the suggestive list. Bhalla et al. [8] who used these two approaches for transliterating person and location name entities. Sharma et al. [9] trained a statistical machine translation system which could successfully translate English- Hindi name entities. They used Moses and Phrasal for this purpose. Moore [10] trained a classifier for English-Hindi transliteration using CRF based approach. They showed that using this approach we can successfully translate name entities with 85.79% accuracy and concluded that CRFs are best suited for processing Indian languages. Kharpa et al. [11] proposed a compositional machine transliteration where several transliteration approaches were combined to improve the accuracy. Their experiments showed the benefits of compositional methodology using some state of the art machine transliteration approaches. Agrawal and Singla [12] used three pronged approach in translating name entities. They used an aligner which generated English equivalents for Chinese name entities, a language model which improved the readability and a ranker which selected the best weighted translations. Ameta et al. [13] developed a transliteration system for Guajarati-Hindi language pair and used it in their Gujarati-Hindi translation engine which could effectively translate Gujarati name entities into Hindi. Bhalla et al. [14] used the Moses toolkit for generating translations for English- Punjabi name entities and claimed an accuracy of 88%. III. A. Experimental Setup PROPOSED SYSTEM In order to implement a transliteration system for English- Hindi, we first analyzed the spellings of different words and found that for Indian languages, most people use different spellings for the same words and all these spellings are taken to be correct. For example, let us consider the word भ रत. For this Hindi word, different people would use the following spellings: Bharat, Bharath, Bhaaratha, and Bhaarat. This is a non-exhaustive list and there can be many more variations to this word. Since this is a general phenomenon and cannot be captured by a rule based approach to transliteration. We tried to apply a hybrid approach to this. First we defined rules to capture phonemes of the English words. We identified that a word can be divided in seven different phonemes which are a group of vowels (V) and consonants (C). Table I shows these phonemes. Once this was done, we collected the text from the web. We generally used news sites for this. We collected 10,000 sentences from these sites. In order to identify name entities we used Stanford s NER [15] tool for name entity extraction. In all we extracted 42,371 name entities. Table II shows the statistics of these name entities. After extraction of name entities, we applied our phonetic algorithm onto them and extracted different phonemes. Then each English phonemes was transliterated into Hindi, thus this created a knowledgebase of English and Hindi phonemes. We then used the ngram probability calculation [16] to generate the probabilities on this English-Hindi phoneme knowledgebase. We used equation 1 to compute the probabilities. Prob(Hindi) = (, ) ( ) Here, Prob(Hindi) was the probability of the Hindi phoneme, while Count(English, Hindi) was the count of the number of times a combination of a English and Hindi phonemes were seen in the knowledge base and Count(English) was the total count of the occurrence of a particular English phoneme in the knowledgebase. Table III shows the snapshot of this knowledge base. TABLE I PHONEMES WITH EXAMPLES S.No. Combination 1. V 2. CV 3. VC 4. CVC 5. CCVC 6. CVCC 7. VCC TABLE II STATISTICS OF NAME ENTITIES IN TRAINING CORPUS S.No. Name Entity Count 1. Person 17, Location 11, Organization 8, Date 1, Time 1, Misc 1,398 Total 42,371 (1)
3 TABLE III SNAPSHOT OF ENGLISH-HINDI PROBABILITY KB English Phoneme Hindi Phoneme Probability a अ bh भ i ई ra र ro र shi श In this sentences Ram and Bhopal are name entities and are identified by NER module. These name entities are then passed onto the phonification module which separately generates phonemes for each word. Ram is split into two phonemes and Bhopal is split into three phonemes. This is then passed onto the conversion module which then transforms these English phonemes into Hindi phonemes and in turn combines them to form the complete word as र म and भ प ल. B. Methodology Since our major goal was to improve the accuracy in translating the name entities, we used the Stanford NER tool on English sentences and extracted the name entities. We used six class name entities as shown in table II. After extraction of English name entities, we applied our phonification algorithm and extracted the individual phonemes. Once this was done, we generated the equivalent Hindi phoneme for each English phoneme. Once all the phonemes for an English word were translated then we combined them to from an equivalent Hindi word. This is shown using the following algorithm and is shown in figure I. Input: English Phoneme List Output: Hindi Word Conversion Algorithm 1. Input the English Phoneme List as phoneme. 2. Read English-Hindi Probability KB as KB 3. phlen = phoneme.length 4. count = 1 5. repeat steps 5 to 8 till count <= phlen 6. generate list of English-Hindi phonemes for phoneme[count] with their respectively probability 7. hinpho[count] = max_prob(phoneme[count]) 8. count += 1 9. combine hinpho to form Hindi word as hword 10. return hword The following examples explain the working of the entire system: English Sentence: Ram is going to Bhopal Stanford NER Output: Ram/Person; Bhopal/Location Phonification Module: Ram => Ra m Bhopal => Bho pa l Conversion Module: Ra => र m => म Bho => भ pa => प l => ल र म भ प ल Fig. I English-Hindi Name Entity Transliteration System English Sentence: Ramesh is going to Delhi to meet Suresh. NER Module: Ramesh/Person; Delhi/Location; Suresh/Person Phonification Module: Ramesh => Ra me sh Conversion Module: Delhi => De lhi Suresh => Su re sh Ra => र me => म sh => श De => द lhi => लह रम श द लह
4 Su => स re => स sh => श स र श In this sentences Ramesh, Suresh and Delhi are name entities. These name entities are then passed onto the phonification module which generates their phonemes. Ramesh is split into three phonemes, Suresh is also split into three phonemes and Delhi is split into two phonemes. This is then passed onto the conversion module which then transforms these English phonemes into Hindi phonemes and in turn combines them to form the complete word as रम श, स र श and द लह. IV. EVALUATION Our basic objective was to develop a mechanism which can effectively handle name entities. This would act a sub-module in any machine translation system and in turn would improve the quality of translation. But in order to incorporate this into any system, we first need to test and verify its accuracy. Thus in order to check the accuracy of the system, we collected 1000 sentences which were not part of the training corpus. These sentences had 9234 name entities. The statistics of this corpus is shown in table IV TABLE IV STATISTICS OF NAME ENTITIES IN TEST CORPUS S.No. Name Entity Count 1. Person 5, Location 2, Organization 1, Date Time Misc 53 Total 9,234 We applied this test corpus onto our system and calculated the accuracy according to standard precision, recall and f- measure. This was done by comparing the results of generated by a human translator who manually transliterated these name entities. They are calculated as per the following equation: Precision (P) = Recall (R) = F Measure = Here, system output is the no. of name entities generated by our system. Reference output is the no. of name entities generated by the human translator and correct is the no. of correct matches between our system s and the human translator are output. (2) (3) (4) In order to provide a level playing field, we gave the output of the NER module to the human translator as well. Table V shows the results of this evaluation. TABLE V Summary of Evaluation Total Name Entities 9,234 System Generated Name 9,180 Entities Human Generated Name 9,234 Entities Correct Name Entities 7,679 Precision Recall F-Measure We attained an accuracy of 83.40%. The reason for this low score was that, our system could very well transliterate the name entities of type Person, Location, Date and Time but most of the name entities of type Organization are not transliterated, they are actually translated. For example, National Institute of Technology is translated into र य य गक स थ न whereas our system gave an output as न शनल इ ट य ट ऑफ ट कन ल ज. According to human translator s reference, this was considered as wrong output. Table VI shows entity wise outputs of our system. TABLE VI ENTITY WISE ANALYSIS OF EVALUATION S.No. Name Entity Count System Correct Output 1. Person 5,263 5,263 4, Location 2,770 2,770 2, Organization 1,108 1, Date Time Misc Total 9,234 9,180 7,679 V. CONCLUSION In this paper we have shown the implementation of a Name Entity Transliteration system for English-Hindi language pair. The system was developed on a hybrid model where English name entities were processing using a set of rules and the conversion of English-Hindi was done statistically. The system did fairly well with all name entities, except for Organization. This was because most organization names change were they are converted from English to Hindi. Thus an immediate future enhancement of this system is to incorporate a translation mechanism to handle this kind of name entities and improve the accuracy of the system. Moreover, some more rules are needed to be added to phonification module, so that better transliterations can be produced.
5 REFERENCES [1] B. Babych, A. Hartley, Improving Machine Translation Quality with Automatic Named Entity Recognition, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, 1-8, 2003/4/13. [2] Y. Al- Onaizan, K. Knight, Translating Name Entities Using Monolingual and Bilingual Resources. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp , [3] A. Hassan, H. Fahmy and H. Hassan, Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora Preoceedings of AMML07, [4] L. Jiang, M. Zhou, L. Chein, and C. Niu, Name Entity Translation with Web Mining and Transliteration, Proceedings of 20 th International Joint Conference on Artificail Intelligence, pp , Morgan Kaufmann, [5] A. Yeh, A. Morgan, M. Colosimo, and L. Hirchman, BioCreAtIvE Task 1A: gene mention finding evaluation BMC Bioinformatics, [6] N. Joshi and I. Mathur, Input Scheme for Hindi Using Phonetic Mapping. In Proceedings of the National Conference on ICT: Theory, Practice and Applications, [7] N. Joshi, I. Mathur and S. Mathur, Frequency Based Predictive Input System for Hindi. In Proceedings of the International Conference and Workshop on Emerging Trends in Technology, ACM, pp , [8] D. Bhalla, N. Joshi and I. Mathur, Rule Based Transliteration Scheme for English to Punjabi. International Journal of Natural Language Computing, Vol 2, No. 2, pp 67-73, [9] S. Sharma, N. Bora and M. Halder, English-Hindi Transliteration Using Statistical Machine Translation in Different Notation [10] R. Moore, Learning Translations of Name Entity Phrases from Parallel Corpus. Proceedings of EACL, [11] M Khapra, P Bhattacharya, A Kumaran, Compositional Machine Transliteration. ACM Transactions of Asian Language Information Processing, Vol 9, 2010 [12] N. Agrawal and A. Singla, Using named entity recognition to improve machine translation. Technical report, Standford University, Natural Language Processing [13] J. Ameta, N. Joshi and I. Mathur, Improving the Quality of Gujarati- Hindi Machine Translation Through Part-of-Speech Tagging and Stemmer Assisted Transliteration, International Journal of Natural Language Computing, Vol. 2, No. 3, pp , [14] D. Bhalla, N. Joshi and I. Mathur, Imporving the Quality of MT Output Using Novel Name Entity Translation Scheme. Proceedings of 2013 International Conference on Advances in Computing, Communications and Informatics, [15] J.R. Finkel, T. Grenger and C. Manning, Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp , [16] D. Jurafsky and J. Martin, Speech and Language Processing, Pearson, 1999.
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
English to Arabic Transliteration for Information Retrieval: A Statistical Approach
English to Arabic Transliteration for Information Retrieval: A Statistical Approach Nasreen AbdulJaleel and Leah S. Larkey Center for Intelligent Information Retrieval Computer Science, University of Massachusetts
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Discovering suffixes: A Case Study for Marathi Language
Discovering suffixes: A Case Study for Marathi Language Mudassar M. Majgaonker Comviva Technologies Limited Gurgaon, India Abstract Suffix stripping is a pre-processing step required in a number of natural
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
PPInterFinder A Web Server for Mining Human Protein Protein Interaction
PPInterFinder A Web Server for Mining Human Protein Protein Interaction Kalpana Raja, Suresh Subramani, Jeyakumar Natarajan Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
A Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
BILINGUAL TRANSLATION SYSTEM
BILINGUAL TRANSLATION SYSTEM (FOR ENGLISH AND TAMIL) Dr. S. Saraswathi Associate Professor M. Anusiya P. Kanivadhana S. Sathiya Abstract--- The project aims in developing Bilingual Translation System for
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
A Comparative Approach to Search Engine Ranking Strategies
26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab
Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham
Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control Phudinan Singkhamfu, Parinya Suwanasrikham Chiang Mai University, Thailand 0659 The Asian Conference on
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia Sungchul Kim POSTECH Pohang, South Korea [email protected] Abstract In this paper we propose a method to automatically
Building A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 [email protected] Qutaibah Althebyan +962796536277 [email protected] Baraq Ghalib & Mohammed
An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages
An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages Rohit Bharadwaj G SIEL, LTRC IIIT Hyd [email protected] Niket Tandon Databases and Information Systems
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS Alok Ranjan Pal 1, 3, Anirban Kundu 2, 3, Abhay Singh 1, Raj Shekhar 1, Kunal Sinha 1 1 College of Engineering and Management,
SMSFR: SMS-Based FAQ Retrieval System
SMSFR: SMS-Based FAQ Retrieval System Partha Pakray, 1 Santanu Pal, 1 Soujanya Poria, 1 Sivaji Bandyopadhyay, 1 Alexander Gelbukh 2 1 Computer Science and Engineering Department, Jadavpur University, Kolkata,
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Getting Off to a Good Start: Best Practices for Terminology
Getting Off to a Good Start: Best Practices for Terminology Technologies for term bases, term extraction and term checks Angelika Zerfass, [email protected] Tools in the Terminology Life Cycle Extraction
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track
Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Yung-Chun Chang 1,2, Yu-Chen Su 3, Chun-Han Chu 1, Chien Chin Chen 2 and
ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH
Journal of Computer Science 9 (7): 922-927, 2013 ISSN: 1549-3636 2013 doi:10.3844/jcssp.2013.922.927 Published Online 9 (7) 2013 (http://www.thescipub.com/jcs.toc) ARABIC PERSON NAMES RECOGNITION BY USING
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Artificial Neural Network and Non-Linear Regression: A Comparative Study
International Journal of Scientific and Research Publications, Volume 2, Issue 12, December 2012 1 Artificial Neural Network and Non-Linear Regression: A Comparative Study Shraddha Srivastava 1, *, K.C.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. [email protected]
End-to-End Sentiment Analysis of Twitter Data
End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India [email protected],
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
Sentiment analysis: towards a tool for analysing real-time students feedback
Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: [email protected] Mihaela Cocea Email: [email protected] Sanaz Fallahkhair Email:
A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students
69 A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students Sarathorn Munpru, Srinakharinwirot University, Thailand Pornpol Wuttikrikunlaya, Srinakharinwirot University,
Prototype software framework for causal text mining
Prototype software framework for causal text mining L.A. de Vries School of Management & Governance University of Twente The Netherlands +31 (0) 648951497 [email protected] ABSTRACT Consulting
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF- SPEECH TAGGING AND STEMMER ASSISTED TRANSLITERATION
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF- SPEECH TAGGING AND STEMMER ASSISTED TRANSLITERATION Juhi Ameta 1, Nisheeth Joshi 2 and Iti Mathur 3 1 Department of Computer
Thirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
Automated Content Analysis of Discussion Transcripts
Automated Content Analysis of Discussion Transcripts Vitomir Kovanović [email protected] Dragan Gašević [email protected] School of Informatics, University of Edinburgh Edinburgh, United Kingdom [email protected]
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Mother Tongue Influence on Spoken English
Mother Tongue Influence on Spoken English Shruti Pal Central Institute of Education (India) [email protected] Abstract Pronunciation is not a major problem in a language classroom until it hinders
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Ontology construction on a cloud computing platform
Ontology construction on a cloud computing platform Exposé for a Bachelor's thesis in Computer science - Knowledge management in bioinformatics Tobias Heintz 1 Motivation 1.1 Introduction PhenomicDB is
PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL
Journal homepage: www.mjret.in ISSN:2348-6953 PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL Utkarsha Vibhute, Prof. Soumitra
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
B-bleaching: Agile Overtraining Avoidance in the WiSARD Weightless Neural Classifier
B-bleaching: Agile Overtraining Avoidance in the WiSARD Weightless Neural Classifier Danilo S. Carvalho 1,HugoC.C.Carneiro 1,FelipeM.G.França 1, Priscila M. V. Lima 2 1- Universidade Federal do Rio de
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,
A Comparative Analysis of Speech Recognition Platforms
Communications of the IIMA Volume 9 Issue 3 Article 2 2009 A Comparative Analysis of Speech Recognition Platforms Ore A. Iona College Follow this and additional works at: http://scholarworks.lib.csusb.edu/ciima
American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access
QUT Digital Repository: http://eprints.qut.edu.au/
QUT Digital Repository: http://eprints.qut.edu.au/ Lu, Chengye and Xu, Yue and Geva, Shlomo (2008) Web-Based Query Translation for English-Chinese CLIR. Computational Linguistics and Chinese Language Processing
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Firewall Verification and Redundancy Checking are Equivalent
Firewall Verification and Redundancy Checking are Equivalent H. B. Acharya University of Texas at Austin [email protected] M. G. Gouda National Science Foundation University of Texas at Austin [email protected]
CS 6740 / INFO 6300. Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage
CS 6740 / INFO 6300 Advanced d Language Technologies Graduate-level introduction to technologies for the computational treatment of information in humanlanguage form, covering natural-language processing
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
Taxonomies in Practice Welcome to the second decade of online taxonomy construction
Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods
Movie Classification Using k-means and Hierarchical Clustering
Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India [email protected] Saheb Motiani
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 22/2013, ISSN 1642-6037 medical diagnosis, ontology, subjective intelligence, reasoning, fuzzy rules Hamido FUJITA 1 KNOWLEDGE-BASED IN MEDICAL DECISION
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE
TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE Sangam P. Borkar M.E. (Electronics)Dissertation Guided by Prof. S. P. Patil Head of Electronics Department Rajarambapu Institute of Technology Sakharale,
TREC 2003 Question Answering Track at CAS-ICT
TREC 2003 Question Answering Track at CAS-ICT Yi Chang, Hongbo Xu, Shuo Bai Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected] http://www.ict.ac.cn/
Language and Computation
Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University [email protected] http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters
An Iterative Link-based Method for Parallel Web Page Mining
An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
A Rule-Based Short Query Intent Identification System
A Rule-Based Short Query Intent Identification System Arijit De 1, Sunil Kumar Kopparapu 2 TCS Innovation Labs-Mumbai Tata Consultancy Services Pokhran Road No. 2, Thane West, Maharashtra 461, India 1
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
