BACKGROUND. and Computer Engineering, University of Louisville, 4 Professional Pharmacy Consultant, 5 SAS Institute, Inc.

Use of Text Miner to Automatically Abstract Patient Information from the Pharmacy Order Database Patricia B. Cerrito 1,2, Antonio Badia 3, John C. Cerrito 4, James Cox 5 1 Department of Mathematics, University of Louisville, 2 Jewish Hospital Center for Advanced Medicine, 3 Deparment of Computer Science and Computer Engineering, University of Louisville, 4 Professional Pharmacy Consultant, 5 SAS Institute, Inc. ABSTRACT Objective: To demonstrate how text mining can be used to extract patient diagnosis information from a pharmacy order database when treatments are ordered without a corresponding diagnosis recorded in the patient chart thereby increasing the accuracy of data extracted for billing and external reporting. Method: The medication orders for patients undergoing open heart surgery were examined. All medication orders for one patient were compressed into one text string. The strings were clustered and the results were compared to those recorded in a clinical outcomes database using manual extraction. Results: Out of 1495 patients in the database for open-heart surgery, 122 patients had diabetes medications, 8 (%) were not identified as diabetic in the clinical database. There were 349 patients identified in the outcomes database as having diabetes without any medications. Conclusion: Automatic extraction of patient information from the pharmacy order database can enhance or replace manual extraction from patient charts. BACKGROUND Medical providers have been resistant to the use of electronic records. 2, 3 They still rely on manual extraction from paper files. 4, 5 There is a general lack of interest among healthcare practitioners to convert to electronic medical records. 6, Hospitals have brought computer records into use into some aspects of their operation, primarily in billing and external reporting. The pharmacy generally uses an electronic database to process orders. However, various databases are purchased as needs become apparent with no real attempt at integration across the various operational systems (Figure 1). 8 Each of the services described in Figure 1 usually has some type of database support with the exception of clinical services. Figure 1. Diagram of possible integration needed for electronic record-keeping (reproduced from T. Benson) 8 INTRODUCTION A considerable proportion of patient medical records remain in paper form across the practice of healthcare. Extracting information from theses files must be done manually, usually from progress notes entered by the physician. If the patient record is long and complex, the probability of a missed risk factor increases exponentially. Information concerning patient risk factors and co-morbidities now serves multiple uses. In particular, the information is used for billing purposes and missed factors can reduce charges resulting in under-payments to the healthcare provider. Also, the information is now used to create external report cards ranking hospital performances. 1 A hospital that under-reports on patient risk factors is at a disadvantage and will rank lower regardless of performance. One of the most complete databases typically used in medical care is the pharmacy order database. This is one part of the healthcare industry that adopted electronic records fairly early, spurred surely by the retail industry that can now boast of nationwide integrated systems. The hospital pharmacy database contains every drug given to every patient in an inpatient setting. This database can be exploited to automatically extract patient information, improving on overall accuracy. If a clinical database containing patient outcomes is also available in electronic form, it is possible to cross-reference via patient identification numbers the pharmacy order database to the patient record. In order to investigate the relationship between prescribing and diagnosis, the pharmacy order database must be compressed so that the observational unit is the patient rather than the patient order. Otherwise, there can be a large number of orders per patient. One way to compress the data is to create one text string containing all medications for a patient. In order to analyze this text string, text mining must be used. The purpose of this paper is to examine one pharmacy database linked to a clinical database containing patient risk factors, complications, and outcomes. The database was restricted to patients undergoing open heart surgery during the year 2001. The database contains information on approximately 1500 patients. The pharmacy order database contained 84,000 different orders for these 1500 patients, or an average of 522. medication orders per patient. Results demonstrate the feasibility of text mining to improve upon or substitute for manual abstraction from patient charts. Unfortunately, integration across these systems is nearly impossible without scrapping all systems and starting over, or without upgrading to a data warehouse. Hospitals are reluctant to budget the necessary startup costs. Therefore, most hospital records remain on paper with manual extraction. Because of this reluctance, there remains a need to compare and contrast the different databases to supplement manual extraction of data. In the absence of a data warehouse, data integration is possible but more difficult. In this paper, data are merged between patient administration services and diagnostic and therapeutic services to examine the relationships between the databases. At the same time, the redundancy in electronic information along with the different needs means that most of the patient record can be recovered in electronic format without additional extraction. Redundancy also allows for auditing of results to determine whether the manual extraction is accurate. Government regulations in the HIPAA (Health Insurance Portability and Accountability Act) may even slow the changeover from paper to electronic format since the government requirements are extremely burdensome but only apply to electronic patient records: 9 HIPAA's Administrative Simplification (AS) provisions are designed to help the healthcare industry reduce administrative costs and combat fraud by standardizing the formats, content, security, and privacy of electronically transmitted healthcare information.

Therefore, there is real incentive to retain paper records in the medical profession. 10 Because much of the information in the patient records is in text format, it is easy to miss information--so inter-rater reliability between extractors remains low. It is known that the quality of manual extraction is generally poor. There are certain potential errors that are well documented: 1 The models are limited by the following factors: 1) Cases may have been coded incorrectly or incompletely 2) the models can only account for risk factors that are coded into the billing data-if a particular risk factor was not coded into the billing data, such as a patient s socioeconomic status and health behavior, then it was not accounted for with these models same patient identification number into a record with all medications concatenated on a single string with separators between. The concatenation was implemented as a series of routines in C++, but the programming can also be translated into SAS code. This concatenation is not routinely available in relational databases. SAS Text Miner was used to create a list of words contained within the documents (Figure2). Figure 2. Screen Shot of Text Miner Settings Screen Manual abstraction is distant from the course of care, often occurring days or months after the patient has been discharged. Therefore, the concern for accuracy is for secondary outcomes such as billing and external reporting. The cost of errors is generally indirect and difficult to quantify. Therefore, the cost is often ignored. Converting from paper to electronic records should reduce many errors that occur when using manual abstraction. Nevertheless, since point-of-care is a hurried and stressful environment, errors can still occur, although the actual error rate is rarely documented. In particular, electronic medical records do not always accurately reflect the actual medications received by a patient. 11 One study demonstrated a 26% discrepancy rate between an electronic medical record and the pharmacy order database. There was no comparable study between paper files and the pharmacy database. The highest rate of error was for cardiovascular medications. 11 Therefore, there is need to routinely compare the pharmacy order database to the data extracted manually into the clinical outcomes database. There is also need to compare the pharmacy orders to the medication information listed in the patient record (which remains in paper format). METHOD Text analysis has become more sophisticated than simply looking at word frequencies. Word groups can be identified as one word, numbers can be included or excluded in the analysis, words with slightly different word endings can be made equivalent. Different weightings of the frequencies can be used. Common words such as the or and can be eliminated from the analysis. The focus shifts from the most commonly used words to those that have the most discriminating potential. The basics of text analysis are 12 1. Coding: determining the basic unit of analysis, and counting how many times each word appears. 2. Categorizing: creating meaningful categories to which the unit of analysis (for example, "terms signifying 'cooperation' and terms signifying 'competition') can be assigned. 3. Classifying: verifying that the units of analysis can be easily and unambiguously assigned to the appropriate categories. 4. Comparing: comparing the categories in terms of numbers of members in each category. 5. Concluding: drawing theoretical conclusions about the content in its context. SAS Enterprise Miner now has a text mining component, called Text Miner, that can be used to analyze text information. It can be used both to cluster documents, using either a hierarchical or expectation maximization procedure, and to categorize documents based upon keywords or groups of keywords contained within a document field. The list was alphabetized and medications used to treat diabetes were flagged; all other words were not flagged. The flagged list was used to cluster the documents using the expectation maximization algorithm (Figure 3). An option to allow nonclustering of specific documents was checked, so that some outlier documents were not assigned to a cluster. Figure 3. Screen Shot of Text Miner Cluster Settings The pharmacy order database contains 84,000 records for approximately 1500 patients. In order to create a table suitable for text mining, it was necessary to compress all records with the

Figure 4. Screen Shot of Text Miner Output Most of the patient observations were not clustered (133 total) because they did not contain any of the medications in the flagged list. The remaining 122 observations were grouped into 1 of 6 clusters depending upon the actual medication prescribed to the patients. These patients were all identified as having diabetes. Each of the clusters contained some patients with diabetes. Patients clustered in the groups identified by insulin were IDDM (Insulin-dependent diabetes); all others were NDDM (non-insulin-dependent diabetes). The clustering mechanism was also used to investigate patterns of physician prescribing. The investigation was restricted to the patients with diabetes. All medications in the wordlist were used for the analysis. In addition, the number of antibiotics prescribed for patients with diabetes were compared to all other patients. This was done by creating a Start List containing only antibiotic medications (Text Miner would then ignore all terms not on the Start List). Clusters based upon two antibiotics were separated from clusters based upon only one antibiotic. This was done separately for patients with diabetes and patients without diabetes. The patients identified by the clinical database as having diabetes were included with the 122 identified by the pharmacy order database. Figure 4 demonstrates the output from text mining of the 122 patients to investigate antibiotic orders. RESULTS Of the 122 patients in the pharmacy order database taking medications for diabetes, 8 (%) are not coded in the clinical database as having diabetes. Conversely, there are 349 of 1459 (23%) patients listed in the clinical database as having diabetes without any order for diabetes medication. The 8 indicate underreporting of patients with diabetes using manual extraction from patient records. The 349 either have their diabetes under control through diet, or they are not currently undergoing treatment for diabetes. It is not known what criteria were used to identify these patients as having diabetes. It was determined that patients with diabetes were more likely to be prescribed two different antibiotics while the remaining population of patients was likely to receive only one. Of the patients with diabetes, 9% were prescribed two or more antibiotics. Overall, only 46% were prescribed two or more antibiotics. For the patients with diabetes, the following clusters were found (5 patients were not clustered) using expectation maximization (Table 1). In the table, CHF represents congestive heart failure and COPD represents Chronic obstructive pulmonary disease.

Table 1. Clusters of the patients with diabetes using text miner. Cluster Number Frequency Medications Diagnosis Patterns 1 13 Abbojec, cefazolin, magnesium hydroxide, acetaminophen, insulin, room, magnesium hydroxide, magnesium, iv, vancomycin, acarbose, albuterol, clonidine, dopamine, mupirocin, sulfate/ipratropium, sodium COPD Mild hypertension Severe infection 2 9 Atropine sulfate, atropine, benazepril, cephalexin, metronidazole, sulindac, trazodone, insulin, aspirin, indapamide, torsemide, mitroglycerin, levothyroxine, alprazolam, lisinopril 3 16 Bacitracin, bromide, citrate, fentanyl, pancuronium, propofol, aspirin, insulin, chloride, spironclactone, potassium chloride, vancomycin, morphine, albuterol, amiodarone 4 31 Acetylcysteine, aspirin, atorvastatin, dinitrate, glargine, glimepiride, hydrochlorothiazide, metoclopramide, lispro, maleate, digoxin, diltiazem, isosorbide, rosiglitazone, bisulfate, clopidogrel, fenofibrate, metformin, mononitrate 5 21 Cefotaxime, magnesium, nifedipine, paroxetine, insulin, enoxaparin, heparin, porcine, famotidine, docusate, rosiglitazone, eptifibatide, rofecoxib, acarbose, aspirin, insulin, glipizide, nitroglycerin, CHF Medium hypertension Fluid retention Hypothyroid CHF Arrythmia IDDM CHF Obesity Hyperlipidemia Hypertension Depression Osteo-arthritis Cluster Number Frequency Medications Diagnosis Patterns atenolol 6 11 Chlorpropamide, repaglinide, tartrate, metoprolol, hydrox/al, hydrox/simeth, amiodipine, besylate, aolpidern, lorazepam, aspirin, insulin, lansoprazole, docusate, losartan, aceptaminophen, oxycodone, glipizide, morphine 16 Atenolol, insulin, potassium, midazolarn, acetaminophen, lidocaine, oxycodone, losartan, dopamine, mupirocin, glipizide, mononitrate, eptifibatide, fenofibrate, torsemide, afenolol Hypertension Hypertension Hyperlipidemia Words such as Abbojec, room, sulfate, appear in Table 1 due to the fact that the word items were separated by spaces where the text parsing did not pick up word groups. Insulin appears in all of the groups. This is because that open heart surgery patients are put on an insulin drip protocol prior to surgery to help reduce the risk of infection in all patients with diabetes. Only the patients in group 4 are actually IDDM. Note that oxycodone and morphine are in the last two clusters, indicating greater pain severity than in the first four clusters where the medication of choice is acetaminophen. Note also that the patients in the first three clusters tend to have more severe co-morbidities (CHF, COPD) than patients in the last four clusters (hypertension, hyperlipidemia). The breakdown for CHF in the clinical database is given in Table 2. Table 2. Number and Proportion of patients with CHF in each cluster. Extraction 1 2 3 4 5 6 CHF 3 4 4 8 1 3 (43%) No CHF 4 (5%) (5%) 3 (43%) (31%) 9 (69%) (35%) 15 (65%) (44%) 9 (56%) (1%) 5 (83%) The medications given in clusters 2,3,4 strongly suggest that the diagnosis of CHF is under-reported since most of the patients in these clusters have been prescribed medication to treat CHF. Similarly, only 1 patient was identified in the clinical database as having a wound infection yet the patients are receiving multiple antibiotics (Figure 5). (25%) 9 (5%)

0 60 50 40 30 20 10 0 Figure 5. Number of Patients Receiving Antibiotic Treatment 1 2 3 4 5 6 or more Number of Different Antibiotics Prescribed It is not known whether the antibiotics are prescribed because of un-reported infections or whether they are prescribed to prevent infections in the patients with diabetes. A partial explanation is that patients are admitted with elevated white blood cell counts (Table 3). Table 3. Number and proportion of patients in each cluster with elevated white blood counts. Extraction 1 2 3 4 5 6 Elevated WBC (58%) (8%) 11 (85%) 1 (63%) 14 (6%) (0%) 10 (83%) No Elevated WBC 5 (42%) 2 (22%) 2 (15%) 10 (3%) (33%) 3 (30%) 2 (1%) An elevated white blood count suggests a non-specific infection, although it can be elevated for other reasons. Interestingly the group with the most potent antibiotics (cluster 1) actually has the lowest proportion of patients with an elevated white blood count. DISCUSSION The pharmacy order database can be used as a check on manual extraction of patient risk factors as indicated by comorbid diagnoses. Conversely, the database can be used to examine the treatment for the co-morbidities to ensure that the patients are receiving optimal care. Therefore, data and text mining can be used to investigate electronic databases of patient information and can improve the overall quality of billing and external reporting. Once the pharmacy order database has been compressed, other investigations can be made (such as the comparisons to white blood cell counts) to determine the extent of under-reporting of patient risk factors when manual extraction is used. Moreover, the pharmacy order database can be combined with information from the various databases used as outlined in Figure 1. Another potential use of text mining is for HIPAA compliance. Companies such as Emergint, Inc. (Louisville, KY) purchase patient information from hospitals, remove all possible patient identifying factors, and sell the HIPAA compliant databases to pharmaceutical companies. Information contained within physician and nursing notes currently cannot be used with databases because they have the potential to contain confidential patient information. Text mining can be used to extract categorical information from the progress notes so that valuable information is not lost in the compliance requirements. Table 4 summarizes the potential analysis using the pharmacy database. Potential Analyses of the Pharmacy Database Examination of geographic location of physicians to determine if there are geographic variations in prescribing practices for similar patient diagnoses. Examination of physician specialty to determine if there are variations in prescribing practices Creation of lists of medications related to specific diagnoses, flagging the lists in the database to recover a patient base with particular diagnoses that can be targeted with mailers (example-diabetes and glucose monitors). This can also be used to improve counseling activities, and to flag issues of polypharmacy. Association rules to determine medication combinations that are generally used. For example, how many use a statin in combination with an ace inhibitor? When new drugs are introduced on the market, the longitudinal shift from an older to newer drug can be identified to improve upon inventory stock. To determine the average (and total distribution) time that a customer spends on a particular type of medication, whether for an acute or chronic illness, and to determine whether patients discontinue medications for chronic illnesses. Reminders can than be sent to individual customers. To determine whether changes in insurance carriers result in shifts of medications for chronic REFERENCES 1. Healthgrades.com. Hospital Report Cards Methodology. Healthgrades.com. Available at: http://www.healthgrades.com/public/index.cfm?fuseacti on=mod&modtype=content&modact=hrc_methodology. Accessed 2001, 2001. 2. Anonymous. Medical Assistant. University of Illinois. Available at: http://www.arentfox.com/quickguide/businesslines/hea lthsht/healthrelatedarticles/ocri002p1.pdf, 2002. 3. Appleby C. Web-o-matic isn't automatic-yet. Internet technology hasn't broken the barrier between doctors & computers. Hospitals & Health Networks. 199;1(22):30-31. 4. Reardon G, Mozaffari E. Use of an automated process to scan and interpret manually coded data forms in pharmacoeconomics research. Value in Health. 2001;4(6):424. 5. Gibson R, Haug P, Horn S. Lessons from evaluating an automated patient severity index. Journal of the American Medical Informatics Association. 1996;3(5):349-35. 6. Loomis G, Ries J, Saywell R, Thakker N. If electronic medical records are so great, why aren't family physicians using them? Journal of Family Practice. 2002;51():636-641.. Aaronson J, Murphy-Cullen C, Chop W, Frey R. Electronic medical records: the family practice resident perspective. Family Medicine. 2001;33(2):128-132. 8. Benson T. Why general practitioners use computers and hospital doctors do not-part 2: scalability. BMJ.

2002;325(32):1090-1093. 9. Anonymous. Health Insurance Portability and Accountability Act (HIPAA). Altarum. Available at: http://216.239.53.100/search?q=cache:2ltsw9yfd8ac: www.altarum.org/hsd/factsheets/hipaa1.pdf+define+hi paa&hl=en&ie=utf-8. 10. Waring N. To what extent are practices 'paperless' and what are the constraints to them becoming more so? British Journal of General Practice. 2000;50(450):46-4. 11. Ernst ME, Brown GL, Klepser TB, Kelly MW. Medication discrepancies in an outpatient electronic medical record. American Journal of Health-System Pharmacy. 2001;58(21):202-205. 12. Martens BVdV. IST 501: Research Techniques for Information Management. Available at: http://web.syr.edu/~bvmarten/index.html. Accessed 2002, 2002. ACKNOWLEDGMENTS The authors would like to acknowledge the support of the Jewish Hospital Center for Advanced Medicine, 200 Abraham Flexner Way, Louisville, KY 40202 CONTACT INFORMATION Patricia B. Cerrito 1,2, Antonio Badia 3, John Cerrito 4, James Cox 5 1 Department of Mathematics 2 Jewish Hospital Center for Advanced Medicine 3 Department of Computer Science and Computer Engineering 4 Professional Pharmacy Consulting 5 SAS Institute University of Louisville Louisville, KY 40292 Work Phone: 502-560-8534 Fax: 502-852-132 Email: pcerrito@louisville.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.