Towards Unsupervised Word Error Correction in Textual Big Data
|
|
|
- Griffin Bryan
- 10 years ago
- Views:
Transcription
1 Towards Unsupervised Word Error Correction in Textual Big Data Joao Paulo Carvalho 1 and Sérgio Curto 1 1 INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, Portugal [email protected], [email protected] Keywords: Abstract: Fuzzy Text Preprocessing, Medical text reports, Natural Language Processing, Word similarity, MIMIC II. Large unedited technical textual databases might contain information that cannot be properly extracted using Natural Language Processing (NLP) tools due to the many existent word errors. A good example is the MIMIC II database, where medical text reports are a direct representation of experts views on real time observable data. Such reports contain valuable information that can improve predictive medic decision making models based on physiological data, but have never been used with that goal so far. In this paper we propose a fuzzy based semi-automatic method to specifically address the large number of word errors contained in such databases that will allow the direct application of NLP techniques, such as Bag of Words, to the textual data. 1 INTRODUCTION Since the invention of written language, textual information contained in documents has been the most commonly used form of expressing human knowledge. As such, textual information should be an important source for automatic knowledge representation. When the information contained in the texts has been properly edited, Natural Language Processing (NLP) tools can be used (with more or less success) to process it. However, in the case of unedited texts, such tools might not be reliable, since one of the most common NLP approaches consists in the use of the so-called Bag-of-Words model. This model essentially relies in word counts to extract information. Therefore, any word error, whether resulting from a typo, wrong transcription, or some cultural error, results on a model error (a misclassification, a miscount, etc.), that can have more or less serious consequences in what concerns knowledge representation. In the present age of Big Data, this problem becomes very relevant, as can be exemplified by the MIMIC II database, a very large database of ICU patients admitted to the Beth Israel Deaconess Medical Center, that will be used here as a case study. The MIMIC II database includes both physiological and text data; However, the information contained in physicians and nurses text notes has never been used in any of the existing several models using the database (Cismondi, et. al, 2012; Fialho, et al. 2012; 2013), despite the fact that such notes contain rich information that is a direct representation of experts views on the real time observable data. This textual information has not been used before because of its size and the particularities of the documents: The reports are not structured as a typical written text sentences are short, have many abbreviations, a reduced number of function words and most of the words are specific and relevant within the context; The reports have a large number of medical technical terms and specific technical abbreviations; There are many numerical values associated with physiological variables readings; Many different ways of expressing/representing the same information. E.g., dates ( ; 6/23/14; June, , etc.), time (10:14PM; 22:04; 2204, etc.), etc.; Text contains a huge number of typographical and other word errors due to how the texts were collected (real time transcriptions from recordings; poor Optical Character Recognition of manuscripts; etc.); Text contains many other artifacts, such as misplaced control characters that break sentences into paragraphs, escape sequences, etc.
2 These particularities prevent the effective use of common Natural Language Processing (NLP) techniques and hinder their use as automatic information providers, as shown in (Carvalho and Curto, 2014). Here we will focus in improving the process of automatic word error detection and correction in such databases. Error detection and correction is something that everyone has become familiar with since the advent of word processors. Nowadays, most of the (computer / tablet / smartphone) written texts are automatically corrected in real-time (i.e., as they are being generated). However, everyone knows that, despite the large recent advances, not all corrections are proper, especially when less common words (e.g. technical terms, named entities, etc.) are being used. Also, most word correction tools are limited to one or two errors per word. The capability of humans to adapt very fast to new situations allows them to detect most unwanted corrections as they are proposed, and therefore react immediately. So, the problem of word error detection and consequent correction is basically non-existent when performed in real time (and as long as the used vocabulary is well known). However, if the texts have not been properly corrected while they were created, then we are facing a complex and expensive task that must usually be done manually, or, even if done automatically, demands a significant human intervention. This is especially relevant in unedited technical text. In the case of Big Data Text databases, this task must be somehow automated, since the size of the database would make the cost of manual offline text editing unbearably expensive. In this paper we propose a fuzzy based semiautomatic method to specifically address the large number of word errors contained in Big Data unedited textual data, focusing specifically in the MIMIC II database. 2 THE MIMIC II DATABASE The developed work uses data from the Multiparameter Intelligent Monitoring for Intensive Care (MIMIC II) database (Saeed, 2002). This is a large database of ICU patients admitted to the Beth Israel Deaconess Medical Center, collected from 2001 to 2006, and that has been de-identified by removal of all Protected Health Information. The MIMIC II database is currently formed by 26,655 patients, of which 19,075 are adults (>15 years old at time admission). It includes high frequency sampled data of bedside monitors, clinical data (laboratory tests, physicians and nurses notes, imaging reports, medications and other input/output events related to the patient) and demographic data. From the available data, and for this particular problem, we are mainly interested on the physicians and nurses notes. The MIMIC II text database contains a total of 156 million words with 3 or more characters, and distinct words. Of these distinct words, only (12%) appear in known word lists: appear on the SIL list of English known words (which contains distinct words) (SIL, 2014), and 429 appear on additional lists containing medical terms not common in English. The remaining words are simply unknown to dictionaries, and most are the result of typing or cultural errors! As an example of the extent of such errors, here is a non-extensive list of the different misspelled variants of the word abdomen found in the MIMIC II database: abadomen, abdaomen, abndomen, badomen, abdaomen, abdeomen, abdcomen, abdemon, abdeom, abdoem, abdmoen, abdemon, abdiomen, abdman, abdmen, abdme, abddmen, abbomen, abdmn, abdme, abdmonen, abdonem, abdoben, abdodmen, abdoemen, abdoem, abdoem, abdomin. It should be noted that these errors are not isolated, e.g., the incorrect form abdomin appears 1968 times in the database. 3 RELATED WORK A typographical error, colloquially known as typo, is a mistake made in the typing process. Most typographical errors consist of substitution, transposition, duplication or omission of a small number of characters. Damerau (1964) considered that a simple error consists in only one of these operations. Nowadays, many other types of errors can be found in text databases: Errors associated with smaller keyboards became relevant due to their effect in the increase of the number of word typos; Errors due to the widespread use of blogs, microblogs, instant messaging, etc.; Errors associated with real time voice transcriptions; Errors associated with poor Optical Character Recognition when digitizing manuscripts; etc. One must also mention linguistic errors, which are mostly due to
3 lack of culture and or/education, and are usually the result of phonetic similarities. As described previously, automatic word error correction is an expensive task when performed off line for which there is no current reliable automatic solution. The best performing methods are those that aim to find the word that is most probable to be the correct word in a given context. These methods are based in probability theory and what is the most likely word to follow a previous one (Jurafsky and Martin, 2009). However such methods demand many resources and are not the most adequate in texts containing many technical terms, many errors, and a very large vocabulary. Independently from the degree of automatization, any word correction tool depends on a good similarity function to find the most likely correct word. Current research on string similarity offers a panoply of measures that can be used in this context, such as the ones based on edit distances (Levenshtein 1966) or on the length of the longest common subsequence of the strings. However, most of the existing measures have their own drawbacks. For instance, some do not take into consideration linguistically driven misspellings, others the phonetics of the string or the mistakes resulting from the input device. Moreover, the majority of the existing measures do not have a strong discriminative power, and, therefore, it is difficult to evaluate if the proposed suggestion is reasonable or not, which is a core issue in unsupervised spelling. Here we will use the Fuzzy Uke Word Similarity (FUWS) (Carvalho and Carola and Tomé 2006a; 2006b; Carvalho and Coheur 2013). This word similarity function combines the most interesting characteristics of the two main philosophies in word and string matching, and by integrating specific fuzzy based expert knowledge concerning typographical errors, can achieve a good discrimination. 4 TOWARDS AUTOMATIC WORD CORRECTION IN TEXTUAL BIG DATA Given the above considerations concerning the extent of the word errors present in the MIMIC II text database, and the impact of such errors when considering any kind of text analysis, we propose a semi-automatic procedure to detect and correct typographical and other word errors in the MIMIC II text corpus, that improves and details the approach presented in (Carvalho and Curto, 2014), and can be extended to other Big Data Textual databases given the appropriate resources. 4.1 Resources Known Words List (KWL) The proposed procedure needs an extensive knownwords list, and one or more technical words lists related with the subject of the textual database. We assume that, despite our best efforts, technical word lists might not be complete. The ordered set of all words contained in these lists, forms what we refer to as the known-words list (KWL). It is also necessary to use a proper word similarity function, preferably fast (due to the size of the target databases) and that can achieve a good discrimination, i.e. has both a good precision and recall so that both false positives and false negatives are minimized. 4.2 Automatic Word Correction Steps Corpus Word List (CWL) The first step consists of creating a list containing all the different words present in the corpus, counting the frequency of each occurrence, and ordering the list. Words with less than 3 characters or more than 15 characters, and/or containing numerals are removed. The removal of short words is due to the difficulty (or impossibility) of properly detecting and correcting such words. This is not an important issue since in NLP such words are often dismissed as they usually contain more noise than useful information. Words with more than 15 characters are usually codes, concatenated words, sometimes chemical compounds, etc., and cannot or should not be corrected. Words containing numerals are removed since they usually consist of tokens that, as previously, should not (and cannot) be corrected High Threshold Filtering In the second step we look very close matches between the CWL and the KWL in order to filter minor typos and to aggregate occurrences of very similar words. This is accomplished by selecting a very high threshold when testing for word similarity. When using the FUWS, the similarity threshold
4 should be above 0.9 note that in (Carvalho and Coheur, 2013), 0.67 is proposed as the standard similarity threshold. Here we want to guarantee that no false positives are generated, hence the much higher threshold. It should be noted that this value will have more impact in errors occurring in longer words (more than 8 chars) than shorter ones. After ordering the resulting list by word frequency we obtain what we refer to as filtered corpus word list (f-cwl). The f-cwl contains both known and unknown words and will be used as a corpus in the remaining procedure High Frequency (HFL) and Low Frequency Words List (LFL) The next step consists in generating two different word lists based on the frequency of the words in the f-cwl: the High Frequency Word List (HFL), and the Low Frequency Word List (LFL). The HFL will contain all the words in the f-cwl whose frequency is higher than a given High threshold ht. The HFL will be used as the known word list in the final word correction step. The reasoning behind the creation of the HFL is based on the assumption that words that occur very frequently in the f-cwl should no longer contain errors (since common errors have been filtered in the previous step). As such, very frequent words in the f-cwl that are not present in the KWL should be considered new words instead of word errors, and should be used to correct errors that occur less frequently (this is consistent with the previous assumption that the available technical terms lists are probably not complete). The LFL will consist of the unknown words of the f- CWL whose frequency is lower than a given Low threshold lt. It is important to note that words present in the KWL are removed from the LFL. The LFL will contain the words that should eventually be corrected in the final word correction step. The LFL should consist mainly of: a) Very specific technical words; b) Unknown abbreviations; c) Unknown named entities (either individuals or organizations) and special non numerical codes; d) Words containing typing and/or other errors; e) Tokens formed by an undue lack of spacing separating proper words. Of these five different cases, only the word errors should be corrected. Note that errors occurring in some unknown named entities or in some abbreviations might also be corrected as long as the correct form is present in the HFL. The definition of ht and lt values is obviously an important issue. Ideally they should be expressed as a percentage of the number of distinct words in the f-cwl, or of the joint word occurrence. Up to now the thresholds have been found empirically, but there are no indications if they can be generalized to other datasets LFL correction The final word correction step consists in attempting to correct the words in the LFL, while using the HFL as the known words list. Common sense would dictate using a normal word similarity threshold (in the case of the FUWS, the value would be 0.67). However, as it will be shown in the case study, best results were obtained using a 10% lower value. 4.3 Correction Results After the application of the previous steps we obtain a word list that will be used to replace the appropriate occurrences in the original textual database. As discussed in 4.2.3, not all unknown words are expected to (or should) be replaced, only the ones resulting from typing or other word errors. 5 CASE STUDY AND RESULTS: MIMIC II DATA BASE WORD ERROR CORRECTION In this section we apply the previously presented procedure to the MIMIC II textual database. To build the KWL we used the SIL English word list (SIL, 2014) and three medical terms lists publicly available online (mtherald, 2014) (Heymans, 2014) (e-medtools, 2014). As a word similarity function we used the above mentioned FUWS, since it is fast, combines the most interesting characteristics of the two main philosophies in word and string matching, and by integrating specific fuzzy based expert knowledge concerning typographical errors, can achieve the intended good discrimination.
5 The original database contains distinct words. After executing Step 1, the resulting CWL contains distinct words (corresponding to a joint occurrence of words). For the High threshold filtering operation a FUWS empirical threshold of was chosen after several tests. This operation affected distinct words, which ended up being combined as 7805 distinct words. The number of errors is estimated (by sampling the results) to be much lower than 1%. Note that this reduction in the number of distinct words is quite significant if one considers that, in English, the necessary vocabulary to properly understand Academic textbooks ranges from 5000 to words. So we managed to reduce a similar number of words by using the high threshold filtering operation. After this step, the f-cwl size is In order to create the HFL, the LFL, and to apply the LFL correction, it was necessary to define the High and Low thresholds, and also choose the FUWS threshold. Several empirical tests were performed in order to find appropriate values. Even if the numbers are not yet fully optimized, words occurring more than 1400 times in the f-cwl appear to be good correction candidates for words occurring less than 800 times when using a FUWS threshold=0.6. Therefore we are currently using lt=800 and ht=1400. The HFL consists of 6153 distinct words (with a joint occurrence of words), i.e., the HFL contains the only the top 2.43% most frequent words, and yet contains 96.73% of the total number of words. The LFL has distinct words (79%) corresponding to a total of only words of the database (0.8%). The correction of the LFL using the HFL and a FUWS threshold of 0.6, resulted in the correction of distinct words (out of the ), reducing them to 5920 distinct words. I.e., the automatic procedure found different words containing typing errors that corresponded to only 5920 words. Those words have a joint occurrence of 59% of the LFL. As expected, the uncorrected words fall mainly into the categories indicated in section However not all the proposed corrections are correct. An estimation based on sampling the 25% more frequent corrections indicates the number of false positives to be around 5%. Overall this results in a very low number of errors when compared to the number of different words and joint occurrences. In the end, only 0.02% of the MIMIC II words are estimated to be incorrectly replaced, and only 0.48% are left uncorrected. Most of the unknown words that were not proposed for correction correspond to cases that should not be replaced or are indeed very difficult to correct without additional lengthy preprocessing. Following are some examples of the observed errors using the format Unknown Word Proposed Correction (# occurrences) (FUWS sim value); comments : Abbreviations: stg gtts (209) (0,7500); stg may mean superior temporal gyrus ptsd ptbd (317) (0,7500); ptsd may mean post-traumatic stress disorder Prefix variation not present in the known words: untolerated tolerated (1) (0,7955); incorrect use of the prefix un-, proper correction should be not tolerated noincreased increased (2) (0,7500); correct substitution should be not increased Terms aliasing due to the lack of spacing between words: remainslow remains (1) (0,6750); should be remains low ) withinthe within (5) (0,6389); should be within the parentsup parents (1) 0,75; should be parent support Correct word is not the most similar: Weerk were (1) (0,8125); probably the correct substitution should be week but it only has a 0,7 similarity Other special cases: gmother mother (14) (0,8214); should be grandmother anormal normal (12) (0,8214); correct substitution should be abnormal 6 CONCLUSIONS AND FUTURE WORK In this paper we propose and describe a novel semiautomatic procedure to detect and correct errors in
6 unedited textual Big Data based on a fuzzy word similarity function. The procedure is being applied to the MIMIC II database with very encouraging results. The largest obstacle to the automatization, generalization and application of the method to other databases consists in the parameterization. Up to now, the definition of the three thresholds used in the proposed procedure, lt, ht and FUWS threshold, has been made empirically. An automatic procedure would be achieved if the values found up to now can be directly applied to other databases. However, that is not necessarily a likely situation, and we will not have an answer until the procedure is tested in other textual Big Data. Another option towards a more automatized process consists in using the obtained values as a starting point to an optimization procedure. Since only three parameters (and most likely only two) are involved, evolutionary algorithms could certainly be used towards this goal. ACKNOWLEDGEMENTS This work was supported by national funds through FCT Fundação para a Ciência e a Tecnologia, under project PTDC/EMS-SIS/3220/2012 and project PEst- OE/EEI/LA0021/2013. REFERENCES A. S. Fialho, F. Cismondi, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein Data mining using clinical physiology at discharge to predict icu readmissions, Expert Systems with Applications, vol. 39, no. 18, pp , December A. S. Fialho, U. Kaymak, F. Cismondi, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein Predicting intensive care unit readmissions using probabilistic fuzzy systems, Proc. of FUZZ-IEEE 2013, Hyderabad, India. Carvalho, J.P., Coheur, L., Introducing UWS A Fuzzy Based Word Similarity Function with Good Discrimination Capability: Preliminary results, Proc. of the FUZZ-IEEE 2013, Hyderabad, India. Carvalho, J.P., Curto, S., Fuzzy Preprocessing of Medical Text Annotations of Intensive Care Units Patients, Proc. of the IEEE 2014 Conference on Norbert Wiener in the 21 st Century, Boston, USA. Carvalho, J.P., Carola, M., Tome,J.A., Using rulebased fuzzy cognitive maps to model dynamic cell behavior in Voronoi based cellular automata, Proc. of the 2006 IEEE International Conference on Fuzzy Systems, pp , Vancouver, Canada. Carvalho, J.P., Carola, M., Tome,J.A., Forest Fire Modelling using Rule-Based Fuzzy Cognitive Maps and Voronoi Based Cellular Automata, Proceedings of the 25th International Conference of the North American Fuzzy Information Processing Society, NAFIPS 2006, Montreal, Canada. Damerau, F.J. A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM, Março 1964, pp e-medtools last accessed May F. Cismondi, A. L. Horn, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. Sousa, and S. Finkelstein Fuzzy multi-criteria decision making to improve survival prediction of icu septic shock patients, Expert Systems with Applications, vol. 39, no. 16, pp Heymans last accessed May Jurafsky, D., Martin, J., 2009, Speech and Language Processing - An introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition, Prentice-Hall Levenshtein, V Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10: M. Saeed, C. Lieu, and R. Mark, Mimic ii: A massive temporal icu database to support research in intelligence patient monitoring, Computers in Cardiology, vol. 29, pp , mtherald last accessed May SIL, last accessed on February 2014.
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India [email protected] 2 School of Computer
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
A Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA [email protected], [email protected]
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search
Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Hybrid Technique for Data Cleaning
Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department
Natural Language Interfaces to Databases: simple tips towards usability
Natural Language Interfaces to Databases: simple tips towards usability Luísa Coheur, Ana Guimarães, Nuno Mamede L 2 F/INESC-ID Lisboa Rua Alves Redol, 9, 1000-029 Lisboa, Portugal {lcoheur,arog,nuno.mamede}@l2f.inesc-id.pt
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired
FUNDING ACKNOWLEDGEMENTS FOR THE GERMAN RESEARCH FOUNDATION (DFG). THE DIRTY DATA OF THE WEB OF SCIENCE DATABASE AND HOW TO CLEAN IT UP.
FUNDING ACKNOWLEDGEMENTS FOR THE GERMAN RESEARCH FOUNDATION (DFG). THE DIRTY DATA OF THE WEB OF SCIENCE DATABASE AND HOW TO CLEAN IT UP. Daniel Sirtes 1 1 [email protected] ifq Institute for Research
Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams
2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment
Healthcare Measurement Analysis Using Data mining Techniques
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik
Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC
Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
A Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Clinic + - A Clinical Decision Support System Using Association Rule Mining
Clinic + - A Clinical Decision Support System Using Association Rule Mining Sangeetha Santhosh, Mercelin Francis M.Tech Student, Dept. of CSE., Marian Engineering College, Kerala University, Trivandrum,
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
Introduction. Chapter 1
Chapter 1 Introduction The area of fault detection and diagnosis is one of the most important aspects in process engineering. This area has received considerable attention from industry and academia because
Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features
Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS Alok Ranjan Pal 1, 3, Anirban Kundu 2, 3, Abhay Singh 1, Raj Shekhar 1, Kunal Sinha 1 1 College of Engineering and Management,
WHITE PAPER. QualityAnalytics. Bridging Clinical Documentation and Quality of Care
WHITE PAPER QualityAnalytics Bridging Clinical Documentation and Quality of Care 2 EXECUTIVE SUMMARY The US Healthcare system is undergoing a gradual, but steady transformation. At the center of this transformation
User Authentication using Combination of Behavioral Biometrics over the Touchpad acting like Touch screen of Mobile Device
2008 International Conference on Computer and Electrical Engineering User Authentication using Combination of Behavioral Biometrics over the Touchpad acting like Touch screen of Mobile Device Hataichanok
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
Susan J Hyatt President and CEO HYATTDIO, Inc. Lorraine Fernandes, RHIA Global Healthcare Ambassador IBM Information Management
Accurate and Trusted Data- The Foundation for EHR Programs Susan J Hyatt President and CEO HYATTDIO, Inc. Lorraine Fernandes, RHIA Global Healthcare Ambassador IBM Information Management Healthcare priorities
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging
Application of Data Mining Techniques in Intrusion Detection
Application of Data Mining Techniques in Intrusion Detection LI Min An Yang Institute of Technology [email protected] Abstract: The article introduced the importance of intrusion detection, as well as
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Clintegrity 360 QualityAnalytics
WHITE PAPER Clintegrity 360 QualityAnalytics Bridging Clinical Documentation and Quality of Care HEALTHCARE EXECUTIVE SUMMARY The US Healthcare system is undergoing a gradual, but steady transformation.
AUTOMATICALLY DECEPTIVE CRIMINAL IDENTITIES
by GANG WANG, HSINCHUN CHEN, and HOMA ATABAKHSH AUTOMATICALLY DETECTING DECEPTIVE CRIMINAL IDENTITIES Fear about identity verification reached new heights since the terrorist attacks on Sept. 11, 2001,
A Comparative Study on Sentiment Classification and Ranking on Product Reviews
A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan
Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software
Fuzzy Matching in Audit Analytics Grant Brodie, President, Arbutus Software Outline What Is Fuzzy? Causes Effective Implementation Demonstration Application to Specific Products Q&A 2 Why Is Fuzzy Important?
Text-To-Speech Technologies for Mobile Telephony Services
Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary
Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science
Electronic Medical Record Mining Prafulla Dawadi School of Electrical Engineering and Computer Science Introduction An electronic health record is a systematic collection of electronic health information
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. [email protected]
Speech Analytics. Whitepaper
Speech Analytics Whitepaper This document is property of ASC telecom AG. All rights reserved. Distribution or copying of this document is forbidden without permission of ASC. 1 Introduction Hearing the
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System
Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System Max G. Arellano, MA; Gerald I. Weber, PhD To develop successfully an integrated delivery system (IDS), it is
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Duplication in Corpora
Duplication in Corpora Nadjet Bouayad-Agha and Adam Kilgarriff Information Technology Research Institute University of Brighton Lewes Road Brighton BN2 4GJ, UK email: [email protected]
ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING
ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING Prashant D. Abhonkar 1, Preeti Sharma 2 1 Department of Computer Engineering, University of Pune SKN Sinhgad Institute of Technology & Sciences,
INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR. [email protected]
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR Bharti S. Takey 1, Ankita N. Nandurkar 2,Ashwini A. Khobragade 3,Pooja G. Jaiswal 4,Swapnil R.
Using Artificial Intelligence to Manage Big Data for Litigation
FEBRUARY 3 5, 2015 / THE HILTON NEW YORK Using Artificial Intelligence to Manage Big Data for Litigation Understanding Artificial Intelligence to Make better decisions Improve the process Allay the fear
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1
Search Query and Matching Approach of Information Retrieval in Cloud Computing
International Journal of Advances in Electrical and Electronics Engineering 99 Available online at www.ijaeee.com & www.sestindia.org ISSN: 2319-1112 Search Query and Matching Approach of Information Retrieval
NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Data Pre-Processing in Spam Detection
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain
Neural Networks in Data Mining
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department
Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.
White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,
A Mind Map Based Framework for Automated Software Log File Analysis
2011 International Conference on Software and Computer Applications IPCSIT vol.9 (2011) (2011) IACSIT Press, Singapore A Mind Map Based Framework for Automated Software Log File Analysis Dileepa Jayathilake
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD
Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching Amir H. Moin and Günter Neumann Language Technology (LT) Lab. German Research Center for Artificial Intelligence (DFKI)
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
A NEW APPROACH TO FILTER SPAMS FROM ONLINE SOCIAL NETWORK USER WALLS
INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A NEW APPROACH TO FILTER SPAMS FROM ONLINE SOCIAL NETWORK USER WALLS M. Bhavish Santhosh Kumar 1, G. V. R. Reddy 2
Data Cleansing for Remote Battery System Monitoring
Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
Real-time customer information data quality and location based service determination implementation best practices.
White paper Location Intelligence Location and Business Data Real-time customer information data quality and location based service determination implementation best practices. Page 2 Real-time customer
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
Analysis of Social Media Streams
Fakultätsname 24 Fachrichtung 24 Institutsname 24, Professur 24 Analysis of Social Media Streams Florian Weidner Dresden, 21.01.2014 Outline 1.Introduction 2.Social Media Streams Clustering Summarization
TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes
TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes Jitendra Jonnagaddala a,b,c Siaw-Teng Liaw *,a Pradeep Ray b Manish Kumar c School of Public
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Enhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential
white paper Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential A Whitepaper by Jacob Garland, Colin Blake, Mark Finlay and Drew Lanham Nexidia, Inc., Atlanta, GA People who create,
