Standardizing the Medical Data in China 1 Shuli Yuwen, 2 Xiaoping Yang 1 School of Management, Hebei University 1 School of Information, Renmin University of China 1 E-mail: yeslzxhll@ruc.edu.cn 2 School of Information,Renmin University of China 2 E-mail: yang@ruc.edu.cn doi : 10.4156/jcit.vol6.issue8.13 Abstract With wide range of applications of EMR in the medical and health industry, the standardization of medical data has become a major bottleneck of the medical and health informationization. This paper proposes a method of standardizing the medical data by medical terminology and CDA using XML. The output of the semantic concept tree can be widely used for further research on analysis of EMR, Data Mining in EMR, information integration and sharing in hospital internal system and medical institutions. 1. Introduction Keywords: EMR, Standardize, Medical Terminology, CDA, XML EMR (Electronic Medical Record) has been a major research area in both biomedical informatics and data mining for decades. There is no denying that EMR aims at achieving information integration and sharing in both the hospital internal systems and the medical institutions. In EMR research and application, the EMR data standardization is the prerequisite for the integration of diverse medical systems. The solution of EMR data standardization determines the level of EMR research and application. In this paper, a method is proposed to standardize EMR data by HL7 CDA based on Chinese clinicians custom and terminology. 2. Background Regarding the standardization of EMR data, Hripcsak G. proposed the hierarchical modularizing thought which provided the message handling, code conversion, monitor warning, visits services and so on to set up the integrated hospital information system[1]; Anderson D. thought that an integrated hospital system can be established by upgrading the core system through a supplier[2]; The results of the investigation by Kuhn K.A. show that the sole system does not need to integrate, but it is actually unable to provide specialty function between the different organizations systems[3]; Stead W.W. thought that there must be a public information reference framework to share the EMR data between the different organizations and regions. The new generation system should integrate data by terminology and code which has the interoperability standard [4]. In China, EMR technology is rapidly developing. The first EMR product is Junwei 1 developed by People's Liberation Army General Logistics Department of Health. Then Sichuan Yinxing, Beijing Upway, Beijing Zhongkeao, Nanjing Haitai is successively engaged in the development of hospital information system and EMR. At present, the standardization of EMR is semi-structure data entry based on medical ontology which different system has different terminology. Now, there are many medical standard development organizations and institutes that commit themselves to standardize EMR data, and make some progresses, like HL7 CDA, DICOM SR, Open EHR and so on. But these standards have not been able to meet all needs at present in China. Moreover, there is still a long way to go before putting them into practice. - 107 -
3. Reference Work 3.1 Overview of CDA Standardizing the Medical Data in China CDA (Clinical Document Architecture) is XML-based document markup standard that specifies the structure and semantics of clinical documents, such as discharge summaries and progress notes, for the purpose of exchange. It is an ANSI-approved HL7 standard, intended to become the de facto standard of EMR. One CDA document is composed of a head and a body. The head determines the documents classification, and provides the basic information about the patient, the provider and so forth. The body is used to describe the clinical report and is composed of nesting section, which is either an unstructured segment or an XML fragment. The CDA framework is as shown in Figure 1[12]. This hybrid nature of XML documents compliant with CDA provides the missing link between healthcare information management systems and EMR, improving the processes of coding and abstracting, and boosting the accuracy of fully compliant claims that maximize data return. Figure 1. CDA Framework CDA is organized into three levels as shown in table 1 where each level iteratively adds more markup to clinical documents, and the clinical content remains constant at all levels. Level One focuses on the conten t of narrative documents with high-level context; hence there is no semantics in Level One. Level Two is possible to constrain both the structure and content of a document by means of a template and thereby increase interoperability since the receiver knows what to expect. Level Three is intended to facilitate the migration from current free text documents to more structured CDA documents and improve the data structured granularity. Level CDA Level One CDA Level Two CDA Level Three Table 1. Hierarchy of CDA levels Description The unconstrained CDA specification The CDA specification with section-level templates applied The CDA specification with entry-level templates applied We can see the main difference between the three levels of CDA documents is its content of the computer processing ability and its data s semantic restraint degree. About the different structure levels, the clinical documents content remain the same, but each different level has different semantic meaning when computer processes it. At present, domestic and foreign researchers mainly focus on the third level. In different countries, the systems have given diverse interpretation and different methods to realize it, with structurized data and non-structurized data narrative description coexisting. - 108 -
3.2 Overview of SNOMED Standardizing the Medical Data in China The International Systematized Nomenclature of Human and Veterinary Medicine (SNOMED) have developed into a comprehensive set of over 150,000 records in twelve different chapters or axes [10]. SNOMED Clinical Terms (SNOMED CT) is a universal healthcare terminology and infrastructure, which will enable interoperability between diverse clinical systems, facilitate communication between healthcare professionals by means of clear and unambiguous patient information. SNOMED CT is a structured list of terms which can be used to describe the conditions of patients, and the conditions include disease, operation, treatment, drug and healthcare administration. It makes it possible for the detailed recording of findings, diagnoses, and the subsequent treatment for patients, both within a single episode of healthcare and across the patient's full care record. The SNOMED CT structure is concept-based. Each concept represents a unit of knowledge, with one or more natural language terms be used to describe it. Every concept has its context, condition and connections with other concepts, including hierarchical is-a relationships as well as other relationship that describe clinical attributes. These compose the core components of SNOMED CT: the concept table, the description table, the relationship table. Concept table: is composed the concept and its ID, the concept status, and the fully specified name and so on. Description table compose the description and its ID, description status, the concept ID, the term, the initial capital status, the description type and so on. Relationship table is composed the relationship ID, the two concepts ID, and the relationship type and so on. The concept, concept s description and the relationship of the concepts as follows example: concept: gastric ulcer concept description: 1 gastric ulcer, 2 stomach ulcer, 4 GU, 4 gastric ulceration the relationship of concepts: 1 is-a disease of stomach, 2 is-a gastrointestinal ulcer, 3 findings stomach 4. Research Framework 4.1 System Model Generally, common structure of EMR uses terminology as its infrastructure, which will provide a firm basis for the future development of enhanced clinical decision support. The concept of common languages and the communication of standard clinical information is a simple one but so far have not been put into practice. The CDA data is the establishment in the massive medicine terminology and coding systems, such as SONMED, LONIC, and ICD-10 and so on, which is referenced in CDA documents by their numeric codes. But these terms are not fit for China's health situation. Therefore, we need to map and compose these terms according to the Chinese clinicians custom, and form the medical terminology that fits the Chinese national condition. Based on this terminology, XML schema or DTD consistent with CDA is used to create the data modal in accord with Chinese clinicians custom and medical information. Finally, HL7 CDA makes the EMR data tree modal for sharing intersystem, as shown in Figure 2: - 109 -
Semanti c Syntax HL7 CDA Representing XML Schema, DTD Compounding Chinese Medical Terminology Concept Localization Figure2. System Model Framework In the system model, the clinical terminology is a foundation, which is mainly from SNOMED CT and the clinicians practice, simultaneously uses LONIC for the laboratory data, ICD-10 for disease data, and Chinese National Drug Library for the drug to describe more detailed and an accurate multiple definition for different system's data sharing. Then large numbers of concepts are extracted as our clinical terminology by SNOMED CT tools using formalized method with the cooperation hospital clinical practice data, which may divide into the metadata which is extracted directly and the component which carries on the combination of the metadata, the two parts forms our clinical terminology conformed to the Chinese clinical custom, which is the core of our model core and the foundation of EMR data standardization. Another work is the operation combination of the clinical diagnosis process according to the SNOMED CT operation concept and the clinical practice, whose results forms the coarse granularity EMR data represented by XML schema or DTD. Then the XML schema or DTD is divided high granularity EMR data artificially according to the clinical practice from the EMR data structurized angle. The XML schema or DTD measures up W3C standard, and the EMR data need HL7 standard to share data between the different systems, so some revisions of XML schema or DTD need to do according to the HL7 CDA restraint. The major technique description is as follows. 4.2. SNOMET CT to Chinese clinical terminology SNOMED CT provides subset software tools, which can extract concept from its concept table. First, a large number of concepts are extracted to form foundation of medical terminology according to clinical investigation results. The concepts under different conditions may have different meanings, so we also need to record the environment of the concept. In SNOMED CT, the description table and relationship table describe the concept's context information, by which we can record the situation of EMR data. These concepts compose our basic element in Chinese medical terminology. Secondly, in our investigation, there are many EMR data that are in common use by clinician, but these data are not in SNOMET CT and we can use multi SNOMET CT concept to describe them. These concepts compose complex information according to the clinical practical information, which form our component in our system model. The element and component constitute the Chinese medical terminology. In this system framework, the clinical event and SNOMED CT are the foundation. Large numbers of concepts are extracted from SNOMED CT using formalized method with the cooperation hospital clinical practice event, which maybe the concept or the combination of concepts of SNOMED CT. The two parts forms our clinical terminology conformed to the Chinese clinical custom. When the clinician used the CDA templates, the related concepts can extract from the terminology for Chinese clinician. 4.2.1 Definition of Requirement SNOMED, LONIC, ICD In practice, clinicians prefer to use the narrative natural language or description language to describe the EMR data, thus makes it possible that certain clinical term will have different meanings between the clinicians under the same situations, which conform to their own customs. As we know that the most important data for a patient is those data concerning his treatment process. So according the clinician s treatment process, the information model composes five events: chief - 110 -
complaint and history of present illness, general examination, diagnosis and prescription, laboratory examination, Instrument inspection. Each event mainly uses the SNOMED CT top level concepts that we list as follow: chief complaint and history of present illness: clinical findings, observable entity, body structure, environments and social context and so on; general examination: clinical findings, procedure, operation, observable entity, body structure, and so on; diagnosis and prescription: clinical findings, observable entity, pharmaceutical and biologic product and so on; laboratory examination: clinical findings, observable entity, organism, specimen and so on; Instrument inspection: clinical findings, observable entity, organism, specimen, environments and social context and so on. 4.2.2 Main Idea As the requirement definition above, we can see that: each event is related to many SNOMED CT top level concepts, and a SNOMED CT top level concept can apply to many events. Information from the clinical practice is that the clinicians prefer to use description concepts in their treatment process, so we transform the clinical description from the practice into the SNOMED CT concepts and the descriptions, which compose our clinical terminology, and the main procedure is as follows: 1) Performing the Chinese word segmentation on the EMR samples from the hospital and sorting them by the frequency by the clinical domain expert; 2) Extracting the key word from EMR in ture by the according to clinician s custom; 3) Retrieving the key word from the SNOMED CT description table, and stroring the retrieval result to the clinical terminology; 4) For the retrieval result from the description table in step 3), retrieving its semantic concept in in the concept table of the SNOMED CT, and aslo storing the semantic concept to the clinical terminology; 5) Using the concept in the clinical terminology to describe the EMR data and templete; 6) If the EMR concept used by clinician is not in the clinical terminology, repeating the step 1)-4) to revising the clinical terminology. 4.2.3 Formalizable Description We denote the clinical event description D={d1,d2,, di}, which is the set of the EMR segmentations. Suppose SD={sd1,sd2, sdm} as the SNOMED CT description table and SC={sc1,sc2,,scn} as the SNOMED CT concept table. We also denote our clinical terminology T={t1,t2,,tj}. As we know, in the clinical event description D, the segmentations are sorted by the frequency according the sample EMR by the clinical domain expert. Another aspect, from the clinical practice and the context of SNOMED CT, we can see the clinical segmentation mostly is the concept set and use descriptive language to record the clinical event. So the first step is computing the intersection of D and SD, not the intersection of D and SC, which compose the main part of our clinical terminology T, the formula as follow: D SD T (1) D SC T (2) Generally, in practice clinician use to use some SNOMED CT concept to compose a terminology, or the concept can expressed by many description words, so the second step is make the relationship for these term on the clinical data set T, we denote it R the follow is the formula: - 111 -
R = {r r is the relationship of the concepts and the description words} (3) 4.3 Rules of EMR Data Structurization 4.3.1 Structurizing Presentation of EMR Data In practice, clinicians prefer to use the narrative natural language to describe the EMR data, thus makes it possible that certain clinical term will have different meanings between the clinicians under different situations, which conform to their own customs. Therefore, the medical data's structurized granularity needs to consider clinician's custom, system's efficiency and the medical data utilization. Narrative data and the structurized data have their respective advantages and demands, and balance has to be kept between them in order to makes up for the fact that the CDA documents are unable to satisfy the demands on the two aspects. Form table 1 we can see that even the third level medical record data structurized granularity is quite sketchy, if expected to apply it in practice. So in our system CDA Level 3 performs the concrete refinement according to the clinical diagnosis process we have investigated many hospitals to describe the EMR data, which forms our EMR main sections: patient information, process of present diagnosing and treating, reference history and the past EMR, as figure 3 shows. And further we can carry on the EMR data s and divide the granularity to which we need and satisfy the EMR data basically the later period utilization. The rules of structurization of the data granularity are as follows: Figure3. EMR Concept Tree 1) The element, component, section and template composed the four levels data structures of EMR. 2) The element is the indivisible unit in the medical record information, which is the concept of SNOMET CT and selected based on our investigation. Each element as an attribute, subelements or fields, such as name, gender and so on. In view of the convenience of data processing, the element is possible to deal with as a child element or a field, which needs to define its data type, frequency and so on. 3) The component is clinical data element set which used frequently by Chinese clinician and had the specific service meaning, like some diagnosis result fragment, some inspection fragment and so on. The component is defined as complex data types, which is the composition of simple data types in step 2), and in practice it can quote the other simple data types or complex data types by the keyword ref. 4) The section combines many elements and components, which has the specific service clinical meaning, used to save patient information such as general information, physical examination information, and diagnosis information. The definition of section is similar with the component explained as 3). 5) The template is the medical record style which defined beforehand by clinicians in different departments, and each clinician may choose the corresponding template to fill in the medical record information, which can dynamic define the semantic information and may be self- - 112 -
adjusting. The template is a complete Schema, as the root node based on the clinical document to illustrate the Schema with the above defined information together. 4.3.2. Structurizing Storage of EMR Data The storage solution of EMR is directly connected with the efficiency and complexity of the medical data process. In order to increase the system efficiency and reduce the data redundancy, this paper introduces a method to store the EMR data, which uses a data model combining relational data table and XML pattern: one aspect, this solution uses relational data table to store the medical data that is less complex and relatively static, so that some operations could be done with higher efficiency, such as query, join the tables and so on; on the other aspect, the semi-structured data, which is more complex and less structured, save as XML schema for solving the problems like complexity between tables, nested queries and sparse data that the relational database has brought to us. What s more, in this way it would be easier for us to query the patient s information as well as sharing data between different medical systems. These two modes exist simultaneously and independently, they cooperate according to their different functions. In order to use CDA XML to describe the EMR, XML Schema must be defined firstly, which limits the EMR data, these mainly include sub-element, element s property, the sub-element s sequence, and data type. Those limitations are described in the graphical design of the XML schema. For example, data type is displayed in the upper left of the graph, multiple squares are used to display this element which could appear multiple times in the XML document, dashed boxes are used to display that the element is optional in the XML document. The general storage solution of EMR in accordance with the medical data structuralize method describe in 4.3.1. Detail rules are as follow: Each element, like name and sex and so on, is processed as a property or a sub-element. Generally, it is processed as sub-element for convenience. Component and section are defined as complex data type, which has multiple properties or many sub-elements. It also can refer to other sub-element or complex data type. Template is a complete XML schema; it is the root of EMR to describe the EMR document s information. A schema represents a patient s treatment process. According to medical practice, a patient s information are not likely to change, and the data type of these information are relatively the most structured, more importantly relate to every other part of EMR. So we use relational table to store those information to increase the system s efficiency. As to other data that are not changed very often and are relative structured, for examples patient s past history, personal history and family history and so on. These data varies from person to person, and information needed to store are also different, in this case relational data table would only increase the complexity of medical data exchange, and also increase the system s redundancy, so we use XML schema to store those information. Relatively the most important data for a patient is those data concerning his treatment process. However those data are also subjected to major changes. As a result, the system is designed to make an independent XML storage pattern for every treatment process. For example, the treatment process for an outpatient includes registration, triage treatment, treatment, general examination, laboratory tests, feedback from the results of examination, diagnosis, prescription and so on. However, according to medical record standard handbook designed by Medical department, some of those processes are optional, so some certain modification to the XML schema by the clinicians is required [9][10][11]. 5. Experiments Now, a system prototype is developed according to the XML schema and CDA-based method above with the hospital cooperation, which has realized the balance between the partial departments documents standardization and the structurization, and has provided one method to integrate the hospital system. HL7 CDA describes the clinical documents based the XML schema given from figure 4, and the main XML document is as follows (shown as figure 4): - 113 -
Figure4. XML Document This system prototype designs a MER data storage solution mixed the relational database and the Native XML database using DB2 V9+J2EE. We have designed two methods of storage EMR data: one stores the medical data by shredding the data into relational database table, and the other one combines the relational database table and XML schema to store the medical data. The experiment uses 1000 outpatients medical data from our cooperation hospital, and selected three typical database operations for test, the selected operations are listed below. Insert patient s medical data; Query single patient using simple queries by name or patient-id; Multi-condition search, conditions are like: name + time, name + clinician s name, and so on; We tested inserting 1, 5, 10, 20 patients data for inserting, the result shows that the combined storage solution is significantly better than only relational database table solution (See figure 5). This is because simply relational database table solution will have to create relations between all the tables, thus increased the data redundancy; however the combined solution only need to insert data according to the XML pattern, thus reduced unnecessary operations. Figure5. Result of Insert Figure6. Result of Query EMR is mainly used for later period's comprehensive utilization, so we designed five queries: Query1 is the most simple query concerning only the patient s information,query2 searches the patient s all medical record information, Query3 searches according to the time, Query4 uses the patient s and the department s information together, Query5 uses the patient s and the clinician s information. The results are shown in figure 6, for those queries that use patient s information; the two methods show similar performance. However, when the queries are concerning patient s medical information, the combined solution shows a significant better performance. This is because relational query will have to use multiple join operation to complete the job, and too much joins between - 114 -
database tables would inevitably affect the performance of the system. Query4 and Query5 are both combined queries, but the results show great difference. This is because the department data are directly listed in the patient s medical record while the doctor s data are listed in an independent table. In design prospect, this solution reduced relational data tables' quantity and is quite succinct because of XML schema's use. On the other hand in application prospect, the 1000 outpatient EMR records testing result indicated that this standardizing storage plan surpassed obviously traditional the pure relational database table storage method. Especially EMR data's application is a mass of statistical analysis in the later period, the mixed storage plan is advantageous in enhances system's inquiry speed. 6. Conclusion and Future Work The question of EMR data standardization used to be delayed by the standard organization, at the same time it can only be put into practice when the practical system allows. Another aspect, medical terminology only suits to English-speaking countries, which hinders the rapid development of EMR in China. This paper proposes a method to standardize the medical data by localizing the medical terminology and HL7 CDA, and develops a prototype which partially realizes the information integration of internal system in hospital departments. The further research aims at optimizing the prototype and improving the Chinese medical terminology from ICD10 and LONIC and other terminology of medicine. 7. Acknowledgement This research was supported by the National Natural Science Foundation of China under Grant 70871115,and Society Science Union of Hebei under Grant 201004020. 8. References [1] Jun Liang, MeiFang Xu, LiZhong Zhang, LanJuan Li, XiaoLin Zheng, DeRen Chen, ShengLi Yang, BaoLuo Li, Ou Jin, Zhou Ji, JunXiang Sun, "Developing Interoperable Electronic Health Record Service in China", JDCTA, Vol. 5, No. 4, pp. 280 ~ 295, 2011 [2] Lilac A. E. Al-Safadi, "Electronic Medical Record Ontology Mapper ", IJACT, Vol. 1, No. 1, pp. 85 ~ 97, 2009 [3] Anderson D, Where did HISS go? a view from Burton upon Trent, Healthcare Computer Information Manage, Vol. 3, No. 1, pp. 85 ~ 97, 2000 [4] Rossi Mori A., Consorti F. and Galeazzi E, Standards to support development of terminological systems for healthcare telemetric, Methods of Information in Medicine, Vol. 11, No. 1, pp. 44 ~ 56, 1998 [5] Tange H.J., SehoutenH.C, KesterA.D., The granularity of medical narratives and its effete on the speed and eompleteness of informanon retrieval, The Joumal of the Ameriean Medieal lnformaties Association, Vol. 7, No. 3, pp. 23 ~ 31, 1998. [6] Duclos C., Venot A., Structuredre Presentation of drug indieations: lexical and semantic analysis and object- oriented modeling, Methods of lnformation in Medieine, Vol. 3, No. 1, pp. 85 ~ 97, 2000 [7] Berg M., G. Oorman, The contextual nature of medical information, International Journal of Medieal Informaties, Vol. 11, No. 1, pp. 34 ~ 45, 1999 [8] Dolin RH, Aschuler L, Boyer S, Beebe C, Behien FM, Biron PV, Shab OA, HL7 Clinieal Document Arthitecture, Release2.0, ANSI-approved HL7 Standard, 2005. [9] SNOMED, SNOMED Clinieal Terms (SNOMED CT), College of Ameriean Pathologists, http://www.snomed.org/, 20101009 [10] Los R K, Van Giniken A M, opensde: A strategy for expressiye and fexible structured data entry, Intemational Joumal of Medical Informaties, Vol. 9, No. 1, pp. 11 ~ 15, 2005-115 -
[11] HL7, Health Level 7, http://www.hl7.org, 200707 [12] HL7organization, HL7 Vision 3, http://www.hl7.org/v3ballot/html/welcome/environment/index.htm,20080313 [13] HL7, HL7 TermInfo, http://www.hl7.org/special/committees/terminfo/index.cfm, 20080518 [14] Huff SM, Proposal for Ontology for Exchang of Clinical Documents, http://www.hl7.org/special/docs/documentontologyproposaljuly00.doc, 20101223 [15] W3C, XML Schema, http://www.w3.org/xml/schema, 20090108 [16] Peng Zhaoli, Meng Pu, Cheng Yujia, Medical Constitutes Medical Record Writing Standard, Scientific Publishing, China, 2005 [17] CHIMA, CHIMA,http://www.chima.org.cn, 20080325 [18] China Medical Information Standardization, Medical Information Standardization, http://www.chiss.org.cn, 20091215 [19] Ministry of Health of the People s Republic of China, Electronic medical Record Standard,www.moh.gov.cn,20100222 [20] CHISS, Accenture, Reports of Chinese Hospital informationization development (white paper),http://www.moh.gov.cn, 20080605-116 -