1 Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for Information Technologies Abstract: One of the major problems in e-health domain is electronic processing of patient health records. The core of the problem is transforming original, free-text health records to structured documents by using a defined structure model. We designed a system, which use combination of two approaches to do this transformation. Regular was used to recognition typical patterns like date or various biophysical parameters. Linguistic was used to analyze sentences and blocks in a text. Resulted structured documents can be used not only for efficient concept-based information retrieval, but, e.g. also for knowledge discovery process in collections of structured electronic patient records. Keywords: patient records, e-health, regular, linguistic, structured patient records 1. ITRODUCTIO Original patient health records are often written by physicians in the form of unstructured text that is not suitable for efficient electronic processing . On the other hand, their electronic version could significantly enhance the possibilities for information retrieval, as well as further of patient records for various purposes . Many publications have suggested positive effects of electronic messaging in health care . From the electronic patient records could profit all groups of stakeholders in the e-health domain providing the possibility to create various types of useful and efficient electronic services . Much more efficient information retrieval by all types of stakeholders (patients about her/his health status, physicians, health insurance companies) can be used e.g. for the following purposes. Sharing of all available information about patients between various departments in a hospital, as well as between various types of medical specialists, avoiding e.g. unnecessary repetition of similar investigations. Generation of pre-filled forms of various types by patients, physicians as well as by insurance companies). Efficient decision support is possible when a large number of electronic health records from history are available and can be used e.g. as follows. Decision support for various physicians, who could consult similar cases from the history. Discovery of new, interesting patterns, which can result in new knowledge about particular decease, effects of pharmaceuticals etc. (for more details see Section 4). Efficient of particular physicians behavior by health insurance companies. The rest of this paper is structured as follows. Section 2 presents the proposed system. First, some basic characteristics of the real data and their common features are summarized. Second, the functional architecture of the proposed and implemented system is provided and explained. Section 3 describes experiments performed with the implemented system on real data from two hospitals. In section 4 some future possibilities to exploit available structured data using knowledge discovery techniques are sketched. Finally section 5 summarizes main contributions of this paper.
2 2. SYSTEM DESIG 2.1. DATA AALYSIS The core problem in the area of electronic health records is transformation of original, free-text health records to structured documents by using a defined structure model. Therefore detailed data was necessary in order to design a suitable architecture for our system. We have available real data from The European Center for Medical Informatics, Statistics and Epidemiology of Charles University and Academy of Sciences (EuroMISE Center 1 ). Data are representing by anonymous health records, which are structured to blocks (social anamnesis, family history, ECG, etc.). In individual blocks are data written as free-text. The data was acquired from cardio clinics in two different hospitals in Czech republic. Basic statistical characteristics of the data are provided in Table 1. regular morphological syntactic training data (hospital 1) testing data (hospital 1) testing data (hospital 2) o. of records Text size [B] found patterns patterns/record 8,90 11,81 10,90 patterns/kb of text 6,49 6,61 5,18 found patterns patterns/record 7,85 10,60 17,37 patterns/kb of text 5,72 5,93 8,25 found patterns patterns/record 0,30 0,35 0,93 patterns/kb of text 0,22 0,20 0,44 Table 1: Statistical characteristics of the given data sets There are some common problems that we identified having analyzed available health records data, e.g.: All physicians who produced the health records have their own, unique writing style. Moreover, there are some small, but notable differences in terminology used in different hospitals, which is implied by different work habits. Significant number of typist s errors. Heavy usage of different acronyms (which may differ also for particular hospitals). Data in individual blocks are mixed. We have also available a structured model (in form of a taxonomy) from EuroMISE Center, which we need to fill in with information extracted patient from health records text. The model consists e.g. from features like blood pressure, pulse, characteristic of ECG, number of smoked cigarettes per day, allergy, etc. 1
3 2.2. SYSTEM ARCHITECTURE Figure 1. shows functional scheme of the proposed system, which uses a combination of regular and linguistic. Use of linguistic approaches is especially difficult for languages like Czech or Slovak , . The system is being implemented using Java technology. Already implemented blocks are marked with bold border in Figure 1. 1.Free-text document 2.Regular 3.Tokenization 4.Morphological 5.Identify blocks 6.Syntactic 7.Semantic 8.Context 9.Mapping to data model 10.Structured document Figure 1: Functional architecture of the proposed system Functionality of particular building blocks in the proposed sequential architecture is briefly explained in the following: 1. Free-text document: Given free-text patient health record. 2. Regular : Looking for regular expressions (example: blood pressure TK120/30 we describe in the form of following regular expression: TK\d+/\d+), which are used as special words in next step of. This was the only type of used in system proposed in . 3. Tokenization: Division document to individual tokens (i.e. words or regular expressions identified in previous step). 4. Morphological : Specifying word class to individual words (such as nouns, verbs, etc.) and their grammatical categories , . 5. Identify blocks: It would be useful to identify particular text blocks such as e.g. family history block. But this is not easy, because sentences are not exactly defined, physicians often do not use regular sentences. 6. Syntactic : We are looking for simple sentences (if not a whole sentence, then at least some verb phrases), because they have got big information value. Mainly we are searching for subjects and predicates . 7. Semantic : As next step would be to define relations between words, which for Slovak or Czech language would imply the necessity to have special dictionary with semantic bindings of particular verbs , . 8. Context : In case that a sentence doe not have a direct object, it is necessary to derive it from a context . But this is not so typical problem for patient health records. 9. Mapping to data model: This is block is to recognize patterns using given data model and results of the previous regular and linguistic and saving recognized patterns to data model. 10. Structured document: Structured document in XML format presents the output of our system.
4 3. EXPERIMETS We used the same data to train and to test as in , where regular only has been used to structure patient health records. Our goal was to increase the resulted precision recognized data model structures in free-text. To evaluate of quality of transformation we used coefficients P (precision), R (recall) and F -measure, defined by equations 1, 2 and 3 respectively. To evaluate F we used = 0,5 (harmonic mean of P and R). Precision: marked_relevant P = 0; 1. (1) marked Recall: marked_relevant R = 0; 1. (2) relevant F -measure: = PR F 0; ( 1 β ) P + βr 1 β. (3) Where: marked_relevant - number of all correctly recognized (marked) expression by the system as relevant to the given data model - number of all expression recognized (marked) by the system as relevant to the given marked data model - number of all expressions in the text, that are relevant to the given data model relevant First, the system was trained with data from patient health records from hospital 1 only. ext, we evaluated the influence of adding linguistic (blocks 3, 4 and 6 in Figure 1) to the regular one (block 2 in Figure 1). The detailed results of experiments are presented in Table 2 (regular only) and in Table 3 (regular as well as linguistic ). file marked fault unmarked P R F txt ,80 0,80 0,80 39.txt ,81 0,62 0,70 61.txt ,00 0,75 0,86 75.txt ,00 0,80 0, txt ,82 0,90 0,86 20.txt ,75 0,75 0,75 65.txt ,60 0,30 0, txt ,93 0,81 0,87 64.txt ,00 0,75 0,86 98.txt ,00 0,82 0,90 total ,88 0,72 0,79 Table 2: Patterns found in data from hospital 1 using regular only
5 file marked fault unmarked P R F txt ,80 0,80 0,80 39.txt ,81 0,62 0,70 61.txt ,00 0,75 0,86 75.txt ,00 0,80 0, txt ,82 0,90 0,86 20.txt ,88 0,88 0,88 65.txt ,80 0,40 0, txt ,93 0,81 0,87 64.txt ,00 0,75 0,86 98.txt ,00 0,82 0,90 total ,90 0,74 0,81 Table 3: Patterns found in data from hospital 1 using regular and linguistic From the tables above we can see, that in two cases from the 10 analyzed documents the results have been improved, namely in patient records 20.txt and 65.txt (they are marked with bold font in both Table 2 and Table 3). In the last experiment we applied our system, trained on data from the hospital 1 only to test data from hospital 2. Detailed results of this experiment are presented in Table 4. We can see that the results are a bit worse in comparison to the previous experiment (see Table 3), but in an acceptable measure. The reason is that physicians in the second hospital have a bit different terminology, use different abbreviations etc. that could not be acquired during the training process from patient health records from other hospital. file marked fault unmarked P R F txt ,82 0,60 0,69 06.txt ,00 0,69 0,81 09.txt ,25 0,22 0,24 13.txt ,86 1,00 0,92 20.txt ,40 0,20 0,27 28.txt ,69 0,75 0,72 24.txt ,91 0,77 0,83 27.txt ,82 1,00 0,90 11.txt ,92 0,92 0,92 17.txt ,73 0,55 0,63 total ,76 0,66 0,71 Table 4: Patterns found in data from hospital 2 using regular and linguistic 4. POSSIBLE USE OF TEXT MIIG There are many possibilities for application of text (data) mining approaches for discovery of new and potentially useful patterns from large number of electronic patients health records. In this section some of these possibilities are sketched. Classification/prediction models - can be built on patient medical records data with known value of the target attribute (e.g. diagnosis). For new patients this classification models can suggest the value of the target attribute . Another application of this predictive text mining approaches is to annotate patient records  and/or populate existing ontology with instances . Clustering and descriptive data mining clustering and suitable visualization of discovered clusters of patients . Moreover, descriptive data mining techniques may be employed in order to digestedly describe the main characteristics of patients from one cluster.
6 Dialog system produced classification/prediction models, clusters with their descriptions as well as discovered association rules (see e.g. ) may be used within a dialog system. This could be a support tool produced e.g. in form of an electronic service, which could help e.g. doctors when facing a new patient (ask for e.g. characterization of similar patients from the same cluster, predicted diagnosis, or known associations for given data about patient etc.). 5. COCLUSIOS In this paper we presented the functional architecture and results achieved in first experiments with a system designed and implemented for transformation of free-text patient health records into a structured, XML format. By using a combination of regular and linguistic we improved the quality of free-text documents transformation when using training and testing set of documents from the same hospital. When documents for training and testing are from different hospitals, improvement of the efficiency measures could not be observed. Presented system is not perfect, but transformation of electronic patient health records into a structured form is a big challenge and this system definitely means a first important step towards this goal. ACKOWLEDGEMETS The work presented in this paper was supported by the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic within the project Document classification and annotation for the Semantic web o. 1/1060/04 and by the German-Slovak research project DAAD o. 8/2004 Text Mining for Metadata Extraction and Semantic Retrieval. REFERECES  Furdik, K. (2003): Information retrieval in natural language making use of hypertext structures. Technical University of Kosice. PhD-thesis (in Slovak)  Hanzlicek, P. (2002): Development of Universal Electronic Health Record in Cardiology. Health Data in the Information Society: Surjan G., Engelbrecht R., Mcair P. (eds.) Amsterdam, IOS Press, pp  Machova, K. (2002): Machine Learning Principles and Algorithms. Elfa Press (in Slovak)  Pales, E. (1994): SAPFO Design of a Paraphrasing System for the Slovak Language. Bratislava, VEDA (in Slovak)  Paralic, J.; Bednar, P. (2004): Text Mining for Document Annotation and Ontology Support. Intelligent Systems at the Service of Mankind, Ubooks, Germany, pp  Rauch, J. (2001): Mining for Statistical Association Rules. In: Fong J., g M.K. (eds.): The 5 th Pacific-Asia Conference on Knowledge Discovery and Data Mining, University of Hong-Kong, pp  Schultz, S.; Hahn, U. (2001): Medical Knowledge Reengineering Converting Major Portions of the UMLS into Terminological Knowledge Base. Int. Journal on Medical Informatics, 64 (2-3), pp  Semecky, J. (2001): Multimedia electronic patient record in cardiology. Charles University in Prague. Diploma thesis (in Czech)  Van der Kam, W.; Moorman, P.W.; Koppejan-Mulder, M.J. (2000): Effects of Electronic Communication in General Practice. Int. Journal on Medical Informatics, 60 (1), pp  Maria Vargas-Vera, David Celjuska: Event Recognition on ews Stories and Semi-Automatic Population of an Ontology. IEEE/WIC/ACM Int. Conference on Web Intelligence (WI 2004), Beijing, China, pp