Automated Tool for Anonymization of Patient Records

Transcription

1 Automated Tool for Anonymization of Patient Records Nikita Raaj MSc Computing and Management 2011/2012 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Acronyms ANNIE A Nearly New Information Extraction system CRISP-DM - Cross Industry Standard Process for Data Mining CR-UK - Cancer Research, United Kingdom EHR - Electronic Health Records GATE General Architecture for Text Engineering GP - General Practitioner HIPAA - Health Information Portability and Accountability Act i2b2 - Informatics for Integrating Biology and the Bedside JAPE Java Annotation Patterns Engine JISC - Joint Information Systems Committee MedLEE - A Medical Language Extraction and Encoding System MeDS - Medical De-identification System MESH - Medical Subject Heading NER Named Entity Recognition NHS - National Health Services NLP - Natural Language Processing PHI - Patient Health Information POS Part of Speech PPM - Patient Pathways Manager RAD - Rapid Application Development UMLS - Unified Medical Language System i

3 Technical Terms Anonymization this is information which does not identify an individual directly, and which cannot reasonably be used to determine identity [1]. For the thesis, anonymization refers to the removal of name, location and any other details that might support identification. Also anonymization and de-identification have been used interchangeably in the project report. Corpus - a collection of text of a specific kind. Free-text body or content of the notes in patient records. It is also referred to as the narratives related to a patient. GeoNames is a geographical database that covers all countries and contains over eight million place names that are available for download free of charge. HIPAA HIPAA defines policies, procedures and guidelines for maintaining the privacy and security of individually identifiable health information [2]. Module the code is divided into sub sections or functions. Each of these functions has been referred to as a module. PHI - is defined as any information about health status, provision of health care, or payment for health care that can be linked to a specific individual [3, 4]. In the thesis, the PHI is sought out in datasets for de-identification before researchers can share the dataset publicly. PPM - Patient Pathway Manager (PPM) is a clinical information system developed and used at Leeds Teaching Hospitals Trust (LTHT). ii

4 Summary In this thesis, we detail an approach to identify and remove information that compromises privacy in free text patient records. To accomplish this, various text mining approaches and natural language processing methods and techniques have been applied. Some of the techniques applied are the use of dictionaries and regular expressions to identify patterns. These have been experimented on and evaluated in order to anonymize patient data. Access to real patient data is restricted to clinicians and the challenges faced while acquiring such a dataset did not allow us to download large datasets for development and testing. Instead a simple easy-to-use tool has been developed and provided to some of the clinical contacts. They were asked to download and test the tool on their PCs inside the fence and report back on their results. This led for the development of a light system, simple to install and run. Before designing and implementing the project, a background research was undertaken to analyse previous work in the domain of natural language processing approaches used to anonymize data. This provided sufficient information and ideas on how to structure and design the project. Moreover, various tools and libraries were employed to aid the implementation phase saving valuable time. Using the knowledge obtained in the background research, a prototype has been designed and developed for use. Overall, the work involved is aimed at creating a method of approaching textual data for anonymization using natural language processing. iii

5 Acknowledgements Most of all, I would like to thank my supervisor, Dr. Eric Atwell, for his guidance throughout my project. He was able to provide enough rope for me to be creative, while providing sufficient guidance to prevent entanglement. Without his expertise and advice during the various stages, this project would not have been where it is today. I would also like to thank Dr. Vania Dimitrova for her valuable constructive feedback, pointers and new insights provided for my interim and progress report. Saman Hina and Sarah Cook who helped me through their knowledge and expertise on the subject area. I will always be grateful to them. Saman not only shared her knowledge on the subject area but grew to be a really good friend with whom I shared an amazing time. My sincere appreciation to the students and evaluators who despite their busy schedule provided the support and valuable feedback required for my project. Dr. Geoff Hall from the Leeds Cancer Research Centre, Samuel Danso from the University of Leeds, Lukasz and Nikita Desai from St. Michael's Hospital. Was it not for the support of my family, I would have never been able to come and work amidst such exceptional people. They have been a strong shoulder to lean on during every step. Finally, I would like to thank my sister Sanjna for her endless love and support. She was there to cheer me during the good and the challenging times in my pursuit to achieve my hopes and dreams. iv

6 List of Equations Equation 1: IAA (Inter Annotator Agreement) Equation 2: Precision Equation 3: Recall Equation 4: F-measure v

7 List of Figures Figure 1: CRISP DM Cycle... 5 Figure 2: RAD Process Model... 5 Figure 3: Legal Guidelines for PHI data, HIPAA Figure 4: De-identification process illustrated as a flowchart Figure 5: Tagging of person name example Figure 6: Ambiguous person name tagging Figure 7: Tagging of relative names (using context clues) Figure 8: Anonymization of tagged data Figure 9: Med_Anonymizer toolkit (zipped file) Figure 10: Files within the Med_Anonymizer toolkit Figure 11: Output files after running the Med_Anonymizer toolkit Figure 12: Leeds cancer centre input file Figure 13: Project plan gantt chart Figure 14: Presentation slides Figure 15: Med_AnonymizerCode class and manifest file contents Figure 16: Command line arguments to create jar file vi

8 List of Tables Table 1: PHI calculation for the 2 annotators Table 2: PHI category breakdown in gold standard corpus Table 3: Calculations based on the 2 sessions held for development of gold standard corpus Table 4: Performance on gold standard corpus Table 5: Ghana test corpus PHI count Table 6: Ghana test corpus performance Table 7: India test corpus performance Table 8: ANNIE performance results vii

9 Table of Contents Acronyms... i Technical Terms... ii Summary... iii Acknowledgements... iv List of Equations... v List of Figures... vi List of Tables... vii Chapter Introduction Overview Problem Statement Aim and Objectives (Business Understanding) Minimum Requirements and Deliverables Further Enhancements Project Schedule and Progress Report Working to the Project Plan Project Methodology Thesis Roadmap... 6 Chapter Background Research Literature Review Background Related Work Text de-identification based on Machine Learning Methods Text de-identification based on Pattern based Systems viii

10 MeDS De-Id MedLEE Perl-based de-identification software package Medtag using Semantic Lexicon Scrub Systems Critical Analysis of Tools Text Mining GATE Additional Support for Prototype: Use of Java Chapter Design of Solution Data Understanding Acquisition of the Data Set JISC Corpus Data Preparation: Corpus Creation Data Quality Modelling Requirements Design Considerations Chapter Summary Chapter Implementation Implementation of Requirements PHI Categories Dictionaries Gold Standard Corpus Development of the Gold Standard Corpus Development Phase Tagger Algorithm PHI Modules Person Names ix

11 Location Names Dates Telephone/Fax Numbers NHS Number/National Insurance Number (NINo) Other PHI Categories Un-identified PHI Categories Discussion Anonymizer Algorithm Deployment User Task 1: Downloading and Installing the Toolkit User Task 2: Running the Med_Anonymizer jar file Chapter Summary Chapter Evaluation Evaluation of Tools and Techniques Evaluation Metrics Performance Measures Precision, Recall and F-measure Evaluation using Gold Standard Corpus Algorithm performance and results Discussion Evaluation with Ghana Test Corpus Results and Feedback Discussion Evaluation with India Test Corpus Results and Feedback Discussion Evaluation with the Leeds Cancer Centre dataset Overview Results and Feedback Enhancements made to the algorithm Evaluation with another system: GATE (ANNIE) x

12 5.8 Chapter Summary Chapter Conclusion Thesis Summary (Project Evaluation) Aim and Minimum Requirements Exceeding Requirements Conclusion Challenges Contributions Future Work Enhancements Future Projects Bibliography Appendix A A.1 Project Reflection Appendix B B.1 Interim Report Comments: Supervisor B.2 Interim Report Comments: Assessor Appendix C C.1: Project Plan-Gantt Chart Appendix D D.1 Regular Expressions D.1.1 NINo (National Insurance Number) D D.1.3 IP Address D.1.4 Postcode D.1.5 Date Appendix E E.1 Presentation delivered at Progress Meeting, July Appendix F F.1 Comments on the software prototype xi

13 F.1.1 Samuel Danso, Kintampo Health Research Centre, Ghana F.1.2 Nikita Desai, Li Ka Shing Knowledge Institute, St. Michael s Hospital, Canada F.1.3 Dr. Geoff Hall, Senior Oncologist, Leeds Cancer Centre Appendix G G.1 Implementation of Jar file Appendix H H.1 Tag Names Appendix I I.1 Code I.2 Toolkit I.3 User Manual xii

14 Chapter 1 Introduction This project is related to an on-going Leeds cancer research funded programme that has been undertaken by Geoff Hall, Senior Lecturer and Oncologist at the School of Medicine and other applicants. Dr. Eric Atwell, Senior Lecturer at the University of Leeds is a co-applicant of the research project and posted this as an MSC project for students. I then took up the challenge to work on this project for my dissertation. 1.1 Overview New discoveries and practices are constantly evolving with respect to the medical field and the patient records have been a goldmine of information for members of the medical community. These patient records account for a large archive of data and can be used to generate statistical models that, for example, predict patient outcome given clinical findings, provide diagnoses given symptoms and identify any particular trends[5]. One of the challenges has been to make the data that has been stored within the clinical database to be made available for research purposes or transferred to health care providers (e.g., clinicians, students, teachers etc.). The stored data contains knowledge that is not just valuable to enhance medical knowledge but also for researchers interested in studying various clinical trends and understanding the factors in disease recurrence and transfer to palliative care. But often this knowledge is buried in free text/clinical notes of the patient records and there has been a constant need for this data to be made available without disclosing the Protected Health Information (PHI) to adhere to patient confidentiality. The patient records itself contain a mix of structured and unstructured data. The structured data contains field names and it is easy to identify any PHI data associated to that field and eliminate them. But for unstructured free text data that represents the notes recording the patient history, illness, medication etc. written in a descriptive format, it is hard to determine where and what PHI appear in the record. One method is to manually browse through the free text and extract only de-identified information, which is extremely time consuming. However through automated de-identification 1

15 approaches the information can be extracted and integrated much more efficiently from the disparate text sources. A discussion of such approaches has been explored under background research in chapter 2. The Leeds cancer centre database is one such medical centre that has large volumes of free text patient records that requires anonymization. This was the main drive to try and build a prototype that can de-identify the free text patient data so that it reduces the manual effort of anonymization. 1.2 Problem Statement Researchers cannot be provided access to free text patient records due to the need to maintain patient privacy as mentioned earlier. According to HIPAA specifications to maintain patient confidentiality, the patient records must be free of 18 PHI categories such as patient name, dates etc. (Section 2.2 addresses the HIPAA guidelines in more detail). Following the guidelines for protecting patient privacy of health care information, the goal of the deidentification process is to be able to recognize instances of PHI in free text patient records and remove or replace them with anonymous tag names. For example, in the sentence Susan has a terrible flu and has been advised bed rest, the name Susan should be identified as the person name within the free text and should be either removed or replaced with an anonymous tag. In principle, for a small corpus of the text annotations and letters, the data can be scrutinized by clinicians and trained medical staff to remove any identifiable references by hand. This is not just expensive but becomes extremely difficult and time consuming when the corpus size is significantly large. Designing an automated prototype that reduces the manual anonymization is thus a suitable approach. Several tools for de-identification have been built for this purpose, some designing new tools and some incorporating previously designed tools for the de-identification process. But most of the tools for de-identification of the free text patient records have focussed on data related to text stored for a particular dataset and hence is not directly applicable to a different dataset. While designing an automated approach, one of the challenges before building the prototype was to understand the PHI categories stored within the dataset. This is important because the prototype cannot be developed effectively without access to patient records that is required to understand the data. Thus obtaining a dataset with free text patient records was of great importance for the thesis. 2

16 1.3 Aim and Objectives (Business Understanding) The overall aim of the project is to use text analytics tools and techniques to develop a deidentification tool to anonymize free text patient records. In order to accomplish this aim, the following objectives must be achieved: 1) Obtain a corpus of free text patient records that has not yet been de-identified to analyse possible PHI categories within the text that needs to be removed 2) Identify patterns within the free text to design rules, expressions for the PHI categories to automate the de-identification process 3) Design and implement a prototype that would anonymize the patient data 4) Develop a gold standard corpus for evaluation 5) Test and evaluate the prototype results to iteratively improve the de-identification algorithm. 1.4 Minimum Requirements and Deliverables 1) To identify potential de-identification processes or techniques that will be suitable for the design of the prototype 2) Build a bespoke prototype that anonymizes the PHI categories within one dataset 3) Produce the resultant data in an appropriate format that is suitable for research purposes 1.5 Further Enhancements 1) Run the prototype on more than one dataset to evaluate its consistency in performance 2) Enhance the de-identification algorithm to include other feasible types of PHI categories that needs to be anonymized 3) Enhance the algorithm to incorporate more advanced techniques to improve the deidentification algorithm 1.6 Project Schedule and Progress Report To ensure successful progress of the project analysis and delivery, weekly project meetings have been scheduled with the supervisor from May Prior to this, meetings were scheduled to help with 3

17 understanding the project requirements. A project plan and a blog site ( has also been setup to track the progress of key milestone actions to be achieved and facilitate discussions. A copy of the project plan can be viewed in appendix C. A presentation was also prepared and delivered at the progress meeting in July (see appendix E) Working to the Project Plan In terms of working and keeping to the project plan detailed out in Appendix C, all milestones were on schedule at the point of the interim report production except one, which was the acquisition of the dataset to begin the design of the de-identification algorithm. Up until then, only a review of the literature had been carried out. To begin work with designing the de-identification algorithm a dataset was supposed to have been acquired by May end to review the data. However, to ensure that milestones were achieved as predicted, work was carried out to meet the project plan deadlines. This led to more number of hours being put in to adhere to the schedule. However, initially sufficient time had been allocated during the project plan process at each stage for any such contingencies. This flexibility allowed for the achievement of milestones on time. 1.7 Project Methodology In terms of the research methodology two approaches have been chosen. Rapid Application Development and CRISP-DM(Cross Industry Standard Process for Data Mining) models[6]. The rationale for choosing these approaches are that they seem to be the best fit for this project compared to other models. On commencement of the project although the minimum requirements and aims were understood the process flow was still flexible and fluid. It has been understood that to achieve the best outcome it is necessary to improvise and refine as knowledge and experience grows. As a result, several other models have been rejected as they did not provide the necessary outlook required for this project and these reasons have been discussed below. Since there was no firm and exacting requirements, the waterfall methodology was rejected. The spiral methodology[7] was another consideration at the time of project commencement but since the project learning curve fits better along the lines of data mining and using the spiral method did not allow to fully tackle the problem, this approach has been disregarded as probably not the best fit for this project. After having spent considerable amount of time to address the fundamental question about the project, it was identified that to bridge the gap between the problem space and building the initial prototype a number of iterations would be necessary to understand the issues and draw conclusions. It can be 4

18 Figure 1: CRISP DM Cycle considered that of all the methods and models, particularly the standard process used is the CRISP- DM framework[8] that seemed to be the best fit in this case as it flows through the entire process. The CRISP-DM model itself is cyclic in nature and has several stages[9] (See Fig 1.1). They can be broken down into six phases; business understanding, data understanding, data preparation, modelling, evaluation, and deployment[10]. Here the outcome of each phase determines the next task or phase that needs to be reviewed. Also each phase fits well with the project tasks that will be undertaken. Source: Figure 2: RAD Process Model Source: Here, the outer cycle represents the cyclic nature of text mining. The arrows indicate the flow of direction of the phases. In relation to this project, each phase co-ordinates with how the data is first acquired before it has been anonymized. This is then passed through the de-identification prototype. The anonymized free text that is obtained is then evaluated and the prototype is progressively improved. The process itself continues after a solution has been deployed. For this project, all stages will be conducted with the exception that the modelling stage for building a prototype to anonymize the data included the Rapid Application Development (RAD) approach. This method uses structured techniques and prototyping which is verified at every stage to refine the data and prototype models. These stages are repeated iteratively for further developments[11]. The RAD 5

19 has four phases namely the requirements phase which is used to understand the text that needs to be de-identified, the design phase to incorporate appropriate programming concepts and tools for deidentification of the text, the development phase, where the actual program code is written and finally the cutover/testing phase that involves testing of the de-identification algorithm to ensure that the program functions appropriately. 1.8 Thesis Roadmap In Chapter 2 of this report, a background research of the methods suitable for the implementation of the system is presented. The review is done for various relevant areas namely on natural language processing, de-identification tools, PHI guidelines and programming language chosen to build the system. In Chapter 3, we discuss the dataset acquisition and pre-processing required for the dataset. Further a discussion about the PHI categories that needed to be identified, the actual modelling of the program and design considerations have been discussed. In Chapter 4, the actual implementation phase has been carried out which describes about the rules applied to identify patterns, followed by a discussion of the actual deployment of the tool. In Chapter 5, various methods have been tried and tested to evaluate the system used. The evaluations in Chapter 5 are divided into three main sections, namely evaluation with a gold standard corpus, evaluation with test corpus and finally evaluation with a different de-identification system such as GATE (ANNIE). Finally, an overall summary and conclusion along with the future work has been discussed in Chapter 6. 6

20 Chapter 2 Background Research Before developing the prototype it is important to investigate the possible tools and methods that have been employed previously. This chapter thus focuses on different tools and methods used and how they can be incorporated for the development of the prototype. It was also necessary to identify the difficulties and challenges that have been faced in earlier deidentification methods applied. 2.1 Literature Review The initial concern was the possibility of a large volume of academic research papers and sources of information. The initial search acquired a number of paper publications that was carried out based on keywords such as de-identification of patient records, de-identification, anonymization of patient data and automated anonymization tools for free text. The literature search returned more than 220 publications of journals and articles, but either they were focussed in medical content or were too similar and mostly focussed on structured data de-identification instead of free text and were discarded. Through backward and forward reading, relevant publications from queries searched in PubMed, ACM Digital Library, conference proceedings, Google Scholar, BMC Medical Research Methodology were used for analysis of the techniques and methods used. The comprehensive view of the previous research on the topics provided a clearer scope from which the project took shape. 2.2 Background Over the years, the Electronic Patient Records (EPR) has been primarily used to provide and store information about the patients. The EPR system behaves as both a tool and a legal document for use by clinicians and health personnel for provision of health care[12]. These medical records account for a large archive of data and by applying machine learning, researchers can then use this information to provide diagnoses given symptoms, predict patient outcome given clinical findings and identify any particular trends[5] thus helping doctors to better match treatments to patient profiles. 7

21 The Leeds cancer centre database has recently been reviewed by the Cancer Research-UK Informatics and has been identified as one of the UK s leading cancer informatics resources. It has a large collection of electronic patient records maintained within a clinical information system known as the Patient Pathways Manager (PPM). This PPM database is used primarily for supporting the delivery of care for patients diagnosed with suspected or confirmed cancer and also has widespread benefits for improving clinical care, research and audit. This knowledge base consists of annotations and letters written in free text format from clinicians reflecting on the patient s health from diagnosis to death or discharge. For researchers who are interested in studying various clinical trends, these medical records provide a large archive of data. In addition, separate studies have also shown the importance of medical discharge summaries in determining trends in adverse events[13] and disease incidences[14]. This provides a strong incentive for the use of Natural Language Processing(NLP)[also sometimes referred to as Medical Language Processing in this domain][15] and event profiling to identify clinical data that acts as a rich source of information to enhance medial knowledge. With the increase in use and adoption of such EPR systems, greater amounts of readily accessible patient data are available for use by researchers and clinicians for several operational purposes. Within the PPM database the notes that contain the patient history and discharge summaries act as a valuable source of information for identification of key stages of disease progression about cancer recurrence and referral to palliative care (i.e. cessation of anti-cancer treatment and referral to end-oflife care). This data may sometimes contain patient relevant information and before such data is made available, protecting patient confidentiality becomes a requirement and this should not be overlooked or understated[16]. To allow researchers to access this data, protocol aims to provide a secure mechanism that guarantees protection of privacy needs to be adapted. The UK government has thus set up some regulations for disclosing patient information and the data can be made available if the NHS code of practice for protecting patient confidentiality has been applied before sharing this information. According to the NHS regulations, patient data can be made available if the patient has provided consent for the use of their data or if the data has been anonymized. Here anonymization refers to information which does not identify an individual directly. It also refers to the information about a patient that cannot be identified by the recipient or if the probability of the patient s identity being discovered is extremely small[17]. Obliging to these regulations, access to the data within the PPM database has currently been restricted to the Leeds Teaching Hospitals Trust (LTHT), NHS clinicians and staff, due to concerns regarding patient confidentiality and information governance. To make this data accessible for research purpose 8

22 so that researchers can apply data mining techniques to identify patterns resulting from this information, the records would need to be anonymized first. This process of anonymization of medical records can be noted to be of high importance in the human life sciences mainly because a de-identified text can be made available for reuse and access by many non-hospital researchers[18]. This sharing of clinical information hence acts as a potential benefit that would facilitate researchers to perform analysis on the data obtained. Additionally, the information can be used to track physician performance, monitor the efficiency of alternate courses of treatment provided as well as gain feedback alerts relative to the course of care for a particular patient[19]. Although, the UK NHS have regulations on maintaining patient privacy, use of guidelines or standards that indicates the confidential information that needs to be removed from the notes are required. In order to implement procedures for maintaining the security and privacy of identifiable health information the Health Information Portability and Accountability Act (HIPAA), US have regulated some standards to address them. These standards have been used widely for deidentification purpose in earlier US research and elsewhere. The UK researchers have also adopted the HIPAA list of identifiers as there is no formal UK legislative equivalent to the US HIPAA act. These standards will also be the basis for this project. De-identified information as defined under the HIPAA Privacy Rules 1 definition is stated as Information that does not allow an individual to be identified because specified identifiers have been removed [20]. These specified identifiers refer to the PHI data contained within a patient record text or file. Thus, the data to be made accessible for others needs to be de-identified[21] and should contain only anonymized data. But, this de-identified text has to conform to the guidelines mentioned by HIPAA [22]. In the US, the guidelines for protecting the confidentiality of health care information have been established by HIPAA in 1996[22]. To conform to these guidelines, records have to be de-identified and they are said to be de-identified when the risk to identify the patient is very small. This risk can be calculated and documented for all records, or the safe harbour approach can be used to show that every record is free of the 18 types of PHI categories as listed in the law. These PHI categories are listed in figure 3. Medical researchers must obey these rules, which states that all research involving human subjects requires informed consent from them. However a study can be exempt from the rule if it is: "Research involving the collection or study of existing data, documents, records, pathological specimens or 1 HIPAA Privacy rules accessible at 9

23 diagnostic specimens and if these sources are publicly available or if the information is recorded by the investigator in such a manner that subjects cannot be identified directly or through identifiers linked to the subjects"[22]. Figure 3: Legal Guidelines for PHI data, HIPAA 2003 To conform to these guidelines there are several approaches for de-identification of patient records and they generally involve two steps: (1) the identification of personal health information within the medical text and (2) the coding and/or replacing of such references with values irrevocable to unauthorized personnel[23]. One of the possible ways to anonymize this data involves scanning the dataset of patient records line by line to identify all occurrences of PHI. This is often realized manually and requires significant resources. Here medical practitioners or doctors who have access to the data anonymize the data that 10

24 can then be used for research purpose by other clinicians or researchers. Dorr et al.[24] have evaluated the time and cost to manually de-identify free text notes and have concluded that it is extremely difficult and time-consuming to exclude all PHI categories required by HIPAA[25] especially when it involved substantial number of documents that needed to be reviewed. Also, it has been estimated that the de-identification performance tends to be error prone and highly variable[26]. Therefore, for large-scale de-identification, automated software that is fine-tuned to the structure of the text, the content of the patient records and the specific requirements of a particular project is required[25]. Being aware of these issues, authors have investigated several automated de-identification methods. An overview of research in this domain involving the systems, tools and techniques used have been discussed next in this section. 2.3 Related Work There are several automatic de-identification systems for free text documents that have been built and evaluated to remove patient identifiers from pathology reports[21, 27-29] and from databases[30-32]. Although de-identification tools mentioned here are aimed towards the identification and removal of person names (such as patient names or healthcare provider names etc.), most systems have been used to try and identify PHI categories. However some have also been extended to identify clinical data and disambiguated it from PHI. This has been done to prevent misclassifying clinical data as PHI and removing any clinically relevant information during the de-identification process. One of the systems even removed all content except clinical data[28]. Most approaches to free text de-identification include machine-learning based systems [18, 33] or pattern-based and lexicon systems[23, 25, 28, 29, 34]. A few relevant systems and methods have been discussed below that have been incorporated by different researchers Text de-identification based on Machine Learning Methods The machine-learning systems generally use labelled examples to automatically search for a pattern. For example, a human annotator or a system such as GATE[35] would tag all words of a training set of free text patient records with a PHI tag or a non-phi tag, similar to the system designed by Aramaki et al[36]. Here the extraction of the PHI is treated as a Named Entity Recognition (NER) task which is a common feature used in several systems [18, 37, 38]. The features from the free text such as digits, the term itself, the part of speech and dependencies are used to find rules that can distinguish between the PHI content and other text. Taira et al.[23] have also developed an algorithm using similar concepts of both lexicons and semantic constraints to assign some probability to a given 11

25 word. The semantics are determined based on the hypothesis that strong associations exist between concepts of such classes. Some have achieved to de-identify medical discharge summaries using a support vector machine (SVM) as the machine-learning algorithm[33, 37, 39]. For this approach, the SVM attempts to find a separating hyper plane between the positive (tagged) and negative examples[40]. Machine learning based approaches have an advantage as they can automatically learn to recognize complex PHI patterns. But, applications that are mostly based on supervised machine learning methods require a large corpus of annotated text to train these machine learning algorithms, sometimes with the resource of significant work by domain experts[16] Text de-identification based on Pattern based Systems Another approach to de-identification is based on pattern matching or rule based methods. In almost all systems based on machine learning also add some pattern matching to extract features for classification or to detect specific types of PHI[16] such as telephone numbers etc. A much detailed study in this area has been described in the next part of this section MeDS Medical De-identification System (MeDS), is a software tool developed at the Regenstrief Institute and is used for identifying patient information from various kinds of clinical data documents and free text reports and remove such identifiable content[23]. A clinical document which is in the plain text file format is provided as the input into the MeDS tool and this produces a text file containing the same clinical document in a de-identified form. The MeDS software is especially used to process free text in the Health Level Seven (HL7) messages format. The MeDS software is written in the JAVA programming language[41] and uses close to 50 regular expressions, patient specific data and dictionaries to identify names of persons and locations to remove them. Firstly, it identifies numeric patterns like medical record numbers, zip codes and addresses. It also uses different methods to detect tags that likely indicate patient or provider names with the help of word lists and algorithms similar to that described by Thomas et al.[21]. The word list is obtained from two sources namely the UMLS index of words and the Ispell dictionary which are further compared with the MESH vocabulary. Finally, MeDS also detects typographical errors and name variants in a message[41]. This is useful as misspelling of words and names can cause issues with de-identification systems and natural language processing systems[42, 43]. 12

26 The software was successful to de-identify a wide range of medical documents from numerous sources thereby maintaining their usefulness for clinical research but still comes with some limitations. The system is still in the development stage and is not publicly available yet and cannot be predicted to work efficiently with a new dataset as the input file formats currently used might differ significantly. However, for the design of our prototype some of the techniques that they have adapted such as the use of dictionaries were useful De-Id The De-Id was originally developed at the University of Pittsburgh and is a commercial tool that uses rule sets, heuristics and supplemental dictionaries to identify the presence of any of the HIPAA specified identifiers within the patient record text and remove them[27]. This helps to improve the readability of the de-identified report. De-Id uses a set of rules, pattern matching algorithms and user-customizable dictionaries that replaces patient identifiers with specific tags to preserve the usability of data[27]. The pattern matching algorithms are used to detect PHI such as telephone numbers and zip codes. For the person names, it looks for a straight match to any member in the list within the text of a report. It also makes use of the UMLS Metathesaurus to aid in identifying valid medical terms that should be retained within the document. The system is designed based on the de-identification and pattern matching based on the U.S dictionaries of names and places. Also, the format of address and zip code identification differs from the post-code address format used in U.K. This requires changes to the rule settings within their software program code. The system was downloaded and checked for adaptability. But some of the issues that had to be considered were that the code was written in Perl script and was a new concept to learn. Also, the input text file format needed to be changed to the system s acceptable form. However, the use of usercustomizable dictionaries and some of the algorithm techniques used closely relates to how our system can be designed MedLEE The Medical Language Extraction and Encoding System (MedLEE)[44] is a general purpose natural language processor that is used to de-identify outpatient clinic notes[45]. MedLEE uses lexical and semantic rules to regularize terms and extract only the valid medical concepts in a report and discard all other terms that are non-medical. Morrison[45] evaluated this by using the extracted text from the database that was run through MedLEE to obtain output containing only parsed concepts tagged with XML[45]. Here the existing 13

27 NLP system can de-identify clinical notes to some degree with the same tagged, structured output that has demonstrated utility in other contexts[45]. However such a system would prove to be less useful for our de-identification purpose since we wanted to retain all content except the PHI categories within the free text. This system could however be useful to extract only medical concepts after the de-identification of the records Perl-based de-identification software package The Perl-based de-identification software package[25] is based on lexical look-up tables, regular expressions and simple rules to identify PHI in free text documents. The system uses four types of look-up dictionaries obtained from the MIMIC II database, words that are in the UMLS Metathesaurus or a spell-checking dictionary, a table of keywords and phrases that act as indicators for PHI ( Mr., Dr., Hospital, Street etc.). Separate processes are used to replace PHI in medical reports. Regular expressions and a combination of table look-ups are used to identify the PHI. This system is closely related to the development of our algorithm Medtag using Semantic Lexicon MEDTAG framework is a de-identification system consisting of a semantic lexicon specialized in medicine and an original rule-based word-sense and syntactic tagger[32]. The scrubber first tags the words in the text with part-of-speech tags. Then, a word-sense tagger identifies the class of a specific word in the phrase as an identity marker belonging to a particular class. Additionally rules were applied on a short string of tokens (up to five words) that helped resolve ambiguities that remained after the two sets of tags were created[32]. Another method for use of lexicons is to identify proper names using a dictionary of common and clinical terms obtained from UMLS Metathesaurus that has been described by Thomas et. al[21]. But this system is limited and removes only proper names from medical reports. It does not attempt to remove other categories of PHI from medical documents Scrub Systems There are several scrub systems that have been developed for de-identification that uses multiple PHI detection algorithms such as the rule-based and dictionary-based algorithms to categorize and label PHI in text reports [28-31]. The system employs several parallel PHI detectors, with each detector tasked to identify a specific category of PHI. For example, there are distinct detectors to identify first names, last names, full names, addresses, cities, states, and countries. It uses a detector precedence based on the number of entities the detector is assigned to detect. For example, the 'Location' detector is tasked to detect a city, state and country pattern. Therefore, the 'Location' detection algorithm has a 14

28 higher precedence than city, state, or country detectors. It also employs strategies to prevent reverse scrubbing. But the scrubbing systems are very specific to a particular database and would require the algorithms to be modified appropriately. Sometimes these systems lead to over scrubbing of records, losing the readability of the document which is one of the concerns that needs to be kept in mind while designing the de-identification algorithm for our project Critical Analysis of Tools Although some systems combine both approaches for identifying different types of PHI, majority of them relied on pattern matching, rules, dictionaries and a few relied on machine learning techniques. It is possible to adapt either type of system but the style of adaptation differs. As mentioned earlier, adaptation of a machine-learning based system emphasizes adding additional training examples and modifying the set of text-based features. This adaptation would require the person to first label the examples and further requires a large number of iterations to evaluate the effect of different features. Given that we are focussing on receiving information from different institutions and specialty areas, the adaptation of a lexicon and pattern system[25] that emphasize on extending word lists, adding new word lists and adding and removing regular expressions appeared to be more appropriate for our needs. For the most part, new words and patterns can be added independently of each other such that the effects of a change are predictable to the expert. This type of adaptation can require more time from the expert, but again presents the possibility of quickly introducing additional domain knowledge without having to constantly re-train the system each time a new clinic is introduced[40].it is also sometimes difficult to precisely judge why the machine learning algorithm committed an error. For example, if a PHI pattern is undetected by the application, adding more training data does not ensure that it will be corrected. On the other hand, the lexicon and pattern approach uses a built collection of word lists, regular expressions and heuristics. This second approach might need experts to spend time to create and organize the word lists and patterns. However, this characteristic can be an advantage because the expert can include knowledge of the field that goes beyond the available training examples or beyond a fixed set of local features[40]. There were also some systems that extracted and retained only concepts that were matched in UMLS, MedLEE or other clinical concepts and ignored the rest of the data. But this is sometimes inefficient as the complete data content is lost. This thesis focused on maintaining the readability of the text even after de-identification and direct use of such systems would be inefficient. Ultimately, whatever system is used its perceived merits and potential shortcomings depend on the purpose of the system, how it is being used and whether it meets and satisfies the needs of the user, whether that is an individual, organization or health provider etc. 15

29 2.4 Text Mining Text mining is a method used to discover patterns from collections of large unstructured text[46]. It does not concern unknown information, but rather it deals with finding known information from a larger collection of data. This process is used to create new information or to find patterns by going through large collections of text. Text mining is one of the main methods used in the biomedical domain. Due to the size of textual data rapidly increasing, it is not feasible for humans to process the large collection of text and text mining is therefore used to automate the process of knowledge discovery and finding patterns[47]. Named entity recognition is one of the methods of text mining in which the goal is to identify within a collection of text, all of the instances of a name for a specific type (for example, all of the gene names and symbols within a collection of MEDLINE abstracts)[48]. This project also aims to identify PHI content within the free text medical records with the use of dictionaries. But one of the areas of difficulties in the identification of information in the text is the ambiguity within the text known as lexical ambiguity. Lexical ambiguity is when a word might have multiple meanings and is hard to determine its correctness. This ambiguity can be resolved by considering other contexts or by determining its probability[49]. One of the methods to avoid such ambiguity is the use of part of speech (POS) tagging, which is the process of marking up a word in a text. POS taggers provide the grammatical position of a word based on its contextual information[50]. Two accurate POS taggers are the Brill and Stanford tagger. In this project, the Stanford POS tagger has been utilized to try and identify how well it would help resolve issues with lexical ambiguity (a more detailed discussion has been provided in section 5.3.3) GATE GATE has been developed by the University of Sheffield and is an open source text analytics software tool which is able to process a wide range of text data[51-53]. It is an architecture, a framework and a development environment for language engineering[52]. GATE is a component based model with the components being one of three types: (1) Language Resources that represent lexicons, corpora or ontologies; (2) Processing Resources, that contain common NLP tasks e.g. tokeniser, part-of-speech tagger, dictionary etc. These processing resources grouped together are known as ANNIE (A Nearly-New IE system), an information extraction (IE) system that is distributed freely with GATE[51-53]; (3) Visual Resources, which enable visualization and editing of components. 16

30 ANNIE, that is distributed freely with GATE can be used to tokenize the input texts, split sentences, look up tokens in its dictionary, produce part-of-speech tags and finally the NE transducer to identify entities based on JAPE grammar rules. ANNIE is only capable of annotating the text file and a program would need to be built to read the annotated text from GATE and then remove all annotated data. While exploring the possibilities of developing a program to annotate files, Java libraries and features have been explored. At this stage the author realised that using programming concepts to formulate regular expressions would be much simpler in comparison to using GATE. The ANNIE system has however been used to perform a comparative study alongside our developed algorithm (a discussion about ANNIE results has been addressed in section 5.7). 2.5 Additional Support for Prototype: Use of Java For the project to be iteratively developed, Java programming language has been used for building an algorithm to incorporate rules and dictionaries to identify PHI data. 17

31 Chapter 3 Design of Solution In this chapter we discuss the acquisition of the dataset and preparation of the corpus required for the de-identification algorithm. Also, an overview of the de-identification requirements and design considerations has been addressed. 3.1 Data Understanding To build the prototype, we need to develop a de-identification algorithm. For this process it was necessary to analyse possible PHI indicators within the text that needs to be de-identified. Thus a corpus of free text patient records was necessary Acquisition of the Data Set At the onset of the project decision in March, it was determined that the Leeds cancer centre would provide the support required for the project. Discussions with them revealed that several levels of privacy clearance would be required for them to be able to provide us access to the data. Although this was anticipated, they had agreed to follow up and provide us access as soon as possible. Almost a month had passed by and still no data was available. So instead we determined that a work around solution to this would be for them to replace a small set of the data file required for development. They were asked to replace the actual PHI information such as patient names (which was determined as the main PHI, based on understanding carried out during the background reading in chapter 2) within their text with any randomly generated name and then provide us this file for development. However, as time passed by we still did not receive any updates or data for almost two months. By the end of May with no patient records at hand it was established that a different approach would need to be sought. We then established that a dataset would be obtained from a different source. This dataset would be used as the development dataset to design the de-identification algorithm. If and when possible, if the Leeds cancer dataset would be provided it would be used for evaluation and exploration to further improve the algorithm design. 18

32 An initial challenge was to identify a source to provide us a quantifiable amount of patient data that could be used for the project. This led towards identifying the NLP research datasets provided by i2b2 2 [54] but they could only release already de-identified discharge summaries. Since the aim of the project is to understand and remove the PHI categories within the free text, a de-identified dataset by itself would not be useful for the project. So whilst working on establishing a source, we came across the JISC (Joint Information Systems Committee) project[55] called e-health GATEway to the clouds [56]. This research project was being carried out by researchers at the University of Leeds. For their project they were using a dataset of 1984 free text patient records that consisted of fictional but realistic e-health records created by 1200 Leeds medical students who had used the TPP 3 GP 4 record system for training between 2008 and This seemed as a good resource for the project as the dataset itself contained free text patient records that had not been previously anonymized. After presenting to them the need for access to such data and agreeing to maintain the records securely only for the said research purpose, the JISC dataset was finally released to us in mid-june JISC Corpus The JISC dataset provided to us was saved as a tab delimited file and contained information stored under 9 fields namely: RowId, Title, FirstName, MiddleName, Surname, Nickname, FormerSurname, Sex and Notes. The dataset had 1984 free text patient records comprising of approximately 1,49,991 words. Exploring the dataset revealed that the RowId that contained the row number and the notes section which contained the free text medical information of the patients was the crucial information required for the project. On reviewing the dataset, we realised that it was rich in PHI content. On closer examination the corpus was found to contain three types of PHI namely the person names, location and dates. These three PHI categories are the most important PHI that would need to be addressed by our deidentification algorithm. 3.2 Data Preparation: Corpus Creation In general for any patient record, the beginning of the patient record contains information in a structured pattern (for example, Name, Date of Birth etc.). An assumption has been made that it is 2 i2b2 - Informatics for Integrating Biology and the Bedside 3 TPP - The Phoenix Partnership 4 GP General Practitioner 19

33 simple to remove content for structured data with field names. Since the field names can be clearly demarcated, data associated for that field is simple to remove. The difficulty arises for the free text (within the dataset referred to as the Notes field) of the record. The notes which contains the PHI information are generally written by clinicians, doctors etc. Our project has been aimed to deal with identification and removal of PHI content from the notes. So for the purpose of the project only the RowId and Notes fields have been considered relevant. Since the JISC corpus contained several other fields, they were needed to be excluded. Thus the corpus was imported into excel and with the use of filters the RowId and Notes fields were extracted out and saved into a separate text file. This has been used as the input file for the de-identification algorithm Data Quality For designing the algorithm it is necessary that the free text notes within the dataset have some consistency. This would allow us to identify patterns to develop expressions. Since the records had been inputted by medical students and not actual clinicians, a few discrepancies were expected to occur. While scanning through the file some of the records were observed to be not written very seriously, but cases like these were very few. The overall content of information in the file had however been written well so the few discrepancies were chosen to be ignored. Also, for designing the de-identification algorithm the data contained within the file was challenging enough with PHI content to develop the system. However, we were trying to design a more generalized de-identification algorithm that can be used on other similar datasets. Thus to improve the algorithm itself, some records have been chosen at random and manual insertions of other PHI content have been introduced to enrich the corpus. 3.4 Modelling To develop the de-identification algorithm, the developer is required to craft different modules in order to account for different PHI categories. So in order to better fit the project requirements, for the modelling stage of the CRSIP DM cycle the RAD approach has been considered (Refer section 1.7). The RAD approach allows for iterative development of the de-identification algorithm. The next section discusses the requirements and design considerations that were required for developing the initial version of the prototype. 20

34 3.4.1 Requirements The first step required the identification of the PHI categories that needed to be addressed by the deidentification algorithm. As discussed previously in section 3.1.2, the HIPAA guidelines list 18 PHI categories that are required to be removed. In compliance to the HIPAA guidelines for protecting the confidentiality of health care information the following three main PHI categories were identified in the JISC dataset that were needed to be removed. Person names All geographic subdivisions smaller than a state All elements of dates (except year) related to an individual (including dates of admission, discharge, birth, death and for individuals over 89 years old, the year of birth must not be used. The second requirement was to find appropriate dictionaries that would be useful for the deidentification algorithm as one of the commonly used features based on review of literature is the use of dictionaries (see section 2.3) for the de-identification process. Finally to evaluate the de-identification algorithm it was required to develop a gold standard corpus. Initially, the evaluation was done manually based on scanning and checking the output file. But as the de-identification algorithm was being iteratively developed, it became tedious to check it manually several times. In order to see the results of the modifications and differences during the iterative development a gold standard corpus would prove useful. The gold standard corpus can thus be used as a benchmark against which the results can be tested and compared to measure the performance of the de-identification algorithm Design Considerations Before developing the de-identification algorithm certain design considerations were required to be addressed. A discussion on the same has been described in this section. One of the most important design considerations was to ensure the readability of the anonymized output. The output file is of utmost importance for our de-identification project since the anonymized records would be directly used for other medical research. Thus the extraction method[25] of removing only PHI information and keeping all the other data information intact is required. It was determined that it was inefficient to just identify the PHI categories and remove them from the input file. This was because it would affect the readability of the output content. So a further step was required to replace identified PHI category with some anonymous tag. 21

35 For the anonymous tag a suitable tag name and format for representing the tag was required. A generalized tag name such as REMOVED [5] can be used. To improve readability of the text, suitable tag names appropriate for the PHI category needed to be considered, but including just a tag name would still confuse the reader. Thus for each identified PHI, a tag name with a suitable representation format was required. One of the methods is the use of XML tag formats. For our purpose an XML tag was not required and the use of simple tags has been considered. After careful analysis some of the regular operators such as use of /, \. (), [], < > were rejected as they could also appear in the notes for example, the medication was advised to be taken 3 times/week contains the / operator. So instead the << followed by the tag name and >> has been considered to replace each identified PHI category. One of other design considerations was to ensure that each of the PHI categories be written as separate modules or routines so that during each iterative stage the algorithm can be improved effectively. The use of separate modules will also allow for incorporation and testing of each module independently. This will also minimize error and de-bugging time. 3.5 Chapter Summary The chapter discusses the dataset obtained for the development of the de-identification algorithm. It also sets out the requirements and design considerations we need to bear in mind while developing the prototype. Also one of the important learning points from this chapter was that it is advisable to ensure that the data is available at project commencement especially when the entire project is reliant on the dataset. However, in practice it is not uncommon for data analytics research projects to encounter problems with access to data. Although we received the JISC dataset to work with in the end, this also initially had come with a promise from the industrial partner to be made accessible to us earlier but the data was not actually made accessible till half way through the project term. 22

36 Chapter 4 Implementation In this chapter, we present the development techniques used to build the de-identification algorithm. Based on the requirements, the implementation of dictionaries and the development of the gold standard corpus have been addressed in the earlier parts of this chapter. Taking into account the design considerations, the de-identification algorithm has been developed. The de-identification algorithm uses regular expressions and dictionaries to first tag all identifiable PHI categories followed by replacing all the tagged instances with an appropriate tag name. A description about the development has been provided later in this chapter. 4.1 Implementation of Requirements Taking into consideration the three requirements mentioned in section 3.4.1, the following have been addressed to fulfil the requirements PHI Categories The following three main PHI categories were identified in the JISC dataset that were needed to be removed. Person names All geographic subdivisions smaller than a state All elements of dates (except year) related to an individual (including dates of admission, discharge, birth, death and for individuals over 89 years old, the year of birth must not be used. However, there were other PHI categories that could also commonly occur within the patient records of other datasets. Before considering the PHI categories slight modifications to the PHI category names with respect to the HIPAA guidelines had to be made. This was done because the PHI categories such as medical record number stated by HIPAA (US) would refer to NHS number in UK. 23

37 Along with the category name modifications the de-identification algorithm has been developed to also consider the following PHI categories Telephone number Fax number NHS number National Insurance number Electronic mail address Web URLs Internet protocol address Postcode During the iterative evaluations, additionally there were other specific information that could narrow down the subset of patients that can be identified other than the PHI categories mentioned by HIPAA. A few additions to the PHI categories have been listed below: Hospital names Ethnicities/nationalities Event dates The above listed PHI categories have been considered for the development of the de-identification algorithm Dictionaries The following look-up dictionaries were created to be used by the de-identification algorithm: Known PHI dictionaries based on known names of patients stored in the dataset. The JISC corpus provided to us had a list of full patient names and nicknames (if given) for each patient record. To obtain the list of person names we used excel to filter, extract and save in alphabetical order only unique single occurrences of the FirstName, MiddleName, Surname, Nickname, FormerSurname fields that were in the original data set file. This permitted direct matching of (correctly spelled) full or partial names of patients. Potential PHI dictionaries that consists of popular generic female and male names[57]. Potential PHI dictionaries that consisted of location names within the UK obtained from GeoNames[58] and possible hospital names. A list of names that are classified as ambiguous if they were found on a list of standard English words obtained from the Atkinson s spell checking oriented word lists[59]. 24

38 Other potential PHI look-up tables that contain names that preceded keywords or phrases that acted as PHI categories. Keywords such as ( Mr., Dr., Miss, Mrs. ) and phrases such as ( partner, son, married to etc.) that served as context clues to identify names. The dictionaries are stored as a list with every word saved in a separate line. All dictionaries were separated out as lists from the algorithm itself. This allowed changing and supplementing the lists for different datasets. This was mainly done as free text patient records obtained from a different medical centre containing patient names and local location names could be easily substituted in the dictionaries Gold Standard Corpus Gold standard corpus of annotated data is useful to assess the performance of systems designed to automatically identify information in texts. It also serves as a resource for system development, while creating rules for text data[60]. Having a gold standard corpus against which the resultant results after every iterative stage can be compared against would reduce the effort during testing of the algorithm rules manually. The gold standard corpus used for the study can also considerably help in the identification of any patterns that can be incorporated as expressions in the algorithm or any rules that might need to be relaxed if they are too over fitted. Generally for development of the gold standard corpus, manual identification of labelling and classifying the PHI categories within a text file is considered. Having only a single individual person for the development has showed low precision. However, with two or more members independently manually identifying the same text followed by the use of a de-identification algorithm to locate any further entities that may have been missed has shown improved results for development of the gold standard corpus[25, 61]. An almost similar concept has been used for the development of the gold standard corpus used for the thesis. Also, most of our de-identification algorithm development and improvement rested upon the results that were obtained by comparing the de-identification algorithm results with the fully identified gold standard corpus. The process has been explained in detail next in this section Development of the Gold Standard Corpus The gold standard corpus considered for development uses the JISC corpus that consists of 1984 free text patient records comprising of approximately 1,49,991 words. Of these records 60 were chosen at random for manual enrichment to include more instances of PHI as well as text content that was more challenging to identify. 25

39 The JISC corpus chosen contained only fictitional but realistic patient data and the PHI content to be identified were non-medical related terms, so the necessity of the data to be reviewed only by a clinician was not a requirement. Although medical students would have been appropriate for the task, the limited time constraint and data privacy restriction of getting them to review it under close supervision 5 only allowed us to get graduates to review and annotate the corpus. For the task itself, two graduate students chosen were briefed and made aware of the types of PHI categories that had to be identified based on the HIPAA guidelines. They were also asked to annotate any ambiguous terms, any identifiable event etc. that could potentially be used to infer the identity of a person. Consistency is critical for the quality of the gold standard corpus[61]. So for training purpose, a sample corpus of 15 randomly chosen records was provided as a text file on the system to see how well they performed with classifying PHI categories. The annotators were asked to read the text, identify and tag PHI categories (for example they were asked to tag PHI identified as person name with the <<PersonName>> tag). All relevant tag titles (see appendix H.1) were provided to both annotators. Both of them fared well in the training task. For our actual gold standard corpus annotation, the full corpus was split in half into almost equally sized files with each file containing randomly chosen records. The review was held over two sessions. During the first session, the annotators were given the first set of records to independently perform a manual review of the notes, labelling and classifying data into their PHI categories. For the second session, the other set of records were first pre-processed and annotated by the first prototype of the de-identification algorithm. The first prototype version of the de-identification algorithm was designed to find all person and location names from the known list of dictionaries that were included. The pre-processed tagged file was then provided to the annotators. They were asked to read the tagged file, remove misclassified PHI categories and label PHI that may have been missed by the algorithm. The two sessions were carried out based on Orgen et al. tasks carried out for construction of a gold standard corpus based on clinical notes[62]. They used this split comparison of two sessions as a way to determine if previously annotated text would improve the speed and consistency of the annotation task without introducing a bias that favoured the developed system. Thus, we have used a similar 5 Supervision - to monitor them while they reviewed the notes in the lab as the files could not be provided to them outside the lab. No personal help with the review or annotations was provided during the process itself. 26

40 approach to evaluate if this approach improved the task in session 2. This would also evaluate if manually re-checking a previously tagged corpus reduced the time and effort required. Following the tasks completed in both sessions, the results obtained were then concatenated and has been detailed out in Table 1. The Inter Annotator Agreement (IAA) metrics as shown in Equation 1 has been applied to the results obtained to generate a summary of agreements among the annotators[61]. A more relaxed approach has been used where even partial matches have been counted as a match. For example, the document contained a location name such as Headingley Group Practice. Annotator 1 only marked Headingley as PHI, whereas Annotator 2 marked the entire name as PHI. Such cases although very few have been considered as a match. Annotator 2 (PHI match) Annotator 2 (PHI non_matches) Annotator 1 (PHI match) Annotator 1 (PHI non_matches) 48 1,49,220 Table 1: PHI calculation for the 2 annotators Equation 1: IAA (Inter Annotator Agreement) The IAA for the entire corpus of all PHI categories clubbed together was calculated and a scoring of was obtained. Although reasonably good, we observed that while calculating the PHI non_matches for the two sessions there was a slight drawback. An annotator who failed to tag a PHI because they had not found one while reviewing would be penalized. To overcome this, results from both annotators needed to be compiled to remove the maximum number of PHI. Thus, a resolution process was carried out by a third annotator who was asked to adjudicate on the differences as well as look over suggestions of the results by both annotators[61]. A clinician 6 with experience on medical records then reviewed the results and arbitrated all disagreements. The results were then compiled into one gold standard corpus. The performance of the de-identification algorithm developed for this thesis has also been evaluated against the finalised gold standard corpus. The final gold standard corpus contains a total of 741 PHI elements. The distribution of each PHI category in the gold standard corpus (after enrichment) has been provided in Table 2. 6 Clinician- Nurse (Band level 5) working at University College London Hospitals 27

41 PHI Category Total PHI Count(After Enrichment) Person Names 611 Location Names (excluding Postcode) 41 Dates 31 Other dates (eg: 14 th of Feb etc.) 9 Other PHI 7 40 Year 9 Total PHI count 741 Table 2: PHI category breakdown in gold standard corpus In addition, based on the finalized gold standard corpus, statistics were calculated to estimate annotator performance at de-identification. These statistical values provide a standard of performance for the de-identification algorithm. They also provide information about how well the annotators performed at both sessions. Table 3 details the performance between session 1 and session 2 for both annotators. Annotator 1 Annotator 2 Total PHI Identified Session 1 (Manual Annotation) Total Missed Total PHI Time 2 hours 10 minutes 1 hour 57 minutes Session 2 (Reviewing Automatic Annotation) Total PHI Identified Total Missed 9 2 Total PHI Time 59 minutes 1 hour 11 minutes Table 3: Calculations based on the 2 sessions held for development of gold standard corpus 7 Other PHI Telephone number, , IP, URL, postcode etc. 28

42 Both sessions included files of almost equal number of words. However, the number of PHI instances that were present in the file provided for session 1 was comparatively less than the number of PHI instances present in session 2. This could be because we used randomly generated notes to be included in each file and the possibility while doing this could have led towards larger section of PHI content grouped in the file used for session 2. Based on the statistical results obtained a discussion has been presented here. Firstly, the time taken to annotate the file in session 1 manually was considerably longer in comparison to read and edit the already tagged file in session 2. This was mainly because the annotators had to pause at every PHI that was identified in session 1 and tag it. In session 2, the number of pause breaks was reduced as the annotator had to only scan already tagged text and pause only when errors were identified and correct them, thus reducing the time and effort to proof read and correct the files. Secondly, the number of missed PHI instances during session 2 was less compared to session 1. Also during session 1, both annotators missed more number of PHI tags while tagging the entire text manually although the number of PHI instances in the entire file was less. This showed that even a partially automated de-identification approach improved the speed and consistency of the PHI identified. The gold standard corpus has been used for evaluating the de-identification algorithm during its iterative developmental stages. This reduced the manual effort of having to compare and check the algorithm performance. Also using the gold standard corpus that had been annotated by two people and reviewed by the third added more value to the algorithm performance results. 4.2 Development Phase Taking into account the design considerations, the de-identification algorithm has been developed and a discussion has been provided in this section. This stage involved the actual programming and implementation of the algorithm that was going to be applied for the de-identification process. After careful evaluation of the tools and techniques, the current de-identification algorithm was implemented in Java[63]. Java is a high-level programming language which is easily supported on different platforms. Several NLP libraries such as parsers, tokenizers were also supported on the Java platform[64] and seemed like a good start for this project. The development phase includes several natural language process techniques such as the use of regular expressions, dictionaries and simple heuristics that have been incorporated to identify each PHI category. 29

43 For the development of the de-identification algorithm, two separate stages have been used. In the first stage, a tagger algorithm is defined. It is used to identify all the PHI categories and tag them appropriately with a PHI tag name. In this stage the PHI content has not been removed. In the second stage, an anonymizer algorithm is defined. It is used to replace the tagged data identified by the tagger algorithm with an appropriate tag name. The two stages were considered to evaluate how well the PHI content had been tagged. The first stage, tagger algorithm tags all identified PHI categories. The output file generated can be used to review misclassified and missed PHI content. The file is also useful for comparison with the tagged gold standard corpus (explained in section 4.1.3). The second stage, anonymizer algorithm is run on the tagged file to remove the actual identified PHI content and replace it with the appropriate PHI tag name. The output file thus contains only de-identified free-text patient records with all PHI content removed and replaced with suitable tags. Also while providing the results to researchers who had access to actual patient records were interested in both the tagged and anonymized output. This would help them not only view the anonymized text but also they could view all the tagged results and provide feedback on any PHI content they felt were misclassified or incorrect while running the de-identification algorithm on their dataset. Based on the two stages, the two algorithms that have been designed for each stage has been discussed in detail in section and Tagger Algorithm The process of de-identification involves scanning the document of free text patient records stored in the input text file line-by-line, dividing them further into individual words separated by whitespace. The algorithm uses the Java programming language, Java API s and Java libraries[65] to perform lexical matching with dictionaries, use of regular expressions and heuristics to identify and remove the PHI found within the text file. For the de-identification of non-numeric tokens, such as names and locations both dictionary look-ups and context checks to locate the PHI have been used. In this case, first the algorithm performs a lexical match on each word in the input text against the dictionaries to identify PHI, which is then labelled with an associated tag. Second, the algorithm further proceeds to perform pattern matching using regular expressions that look for PHI indicators by applying context checks, patterns and simple heuristics to determine the tagging for the text. PHI instances that involve numeric patterns such as post-codes, dates, telephone /fax numbers etc. are identified using regular expressions based on the format in which the data has been stored in the file. 30

44 Appendix D provides examples of the regular expressions written using Java code implemented in the algorithm. Simple heuristics with the help of dictionaries and regular expressions are used to then determine the tagging of certain ambiguous text. Finally another anonymizer algorithm is then run on the tagged file to remove all the identified PHI content and replace it with only the PHI tag names. A description of the de-identification process has been illustrated in the flowchart shown in Figure 4. For the tagger algorithm, each of the PHI categories have been addressed independently as separate modules and a discussion of the PHI modules has been provided in section Tagged data Free Text Notes Tagged data Anonymized data Figure 4: De-identification process illustrated as a flowchart PHI Modules As previously described, the algorithm parses through the text and attempts to de-identify the relevant PHI categories that have been listed in section The following sections describe the modules used to identify and replace each type of PHI Person Names The person names is the most risky and tricky category for identification within the text. The deidentification of person names involves five basic mechanisms. 31

45 In the first mechanism, the algorithm uses lexical matching with the known dictionary that contains all the names extracted from the names fields provided in the JISC dataset. Since the free text patient records provided as the input file itself includes names as one of the PHI categories, these can therefore be extracted by direct matching. Here the lexical matching technique that has been used scans each word within the input text file and checks if any of them are identified as PHI. If they are identified, then the name is tagged with the <<PersonName>> tag following the identified PHI content. To further explain this, an example has been provided in figure 5. Figure 5: Tagging of person name example However, the names in the medical records could be misspelt. In addition, the identification and removal of names of other people mentioned in the notes, that might generally include patient s relatives and/or clinicians is also required. Therefore, it is a necessity that additional mechanisms are used to de-identify such names. In the second mechanism, the algorithm identifies names within the text by lexical matching of words from the free text notes with names obtained from the UK popular male and female name lists[57]. Over 200 such names have been chosen to populate this list. In the third mechanism, the algorithm applies simple heuristics and performs a context check to identify any additional name PHI by examining words that immediately follow words that are PHI categories (such as "Mr.", "Dr.", "Mrs.", "daughter", "son", "partner" etc.). During this mechanism, the de-identification algorithm identifies additional names that could have been either misspelled or do not appear in any of the supplied name dictionaries. One more observation was that names that preceded a person name could also be a potential name. So simple heuristics has then been applied to determine if any of the terms were PHI and have also been tagged. The general common issue with PHI identified as name is that it could also refer to common words (such as May, Brown, Hall etc.). So, in the fourth mechanism, the algorithm compares the names against a list of standard English words obtained from the Atkinson s spell checking oriented word lists[59]. Simple heuristics is then applied to classify potential names as ambiguous or unambiguous. Regular expressions are used to identify patterns of PHI tagged as name but also determined to be a common word. An example has been provided in figure 6. In the example Aimee Read is the person name, here Read is determined as unambiguous and tagged as 32

46 <<PersonName>> in this instance. This was done based on the context clue that since Read follows a person name it is probably a person name as well. However, it is hard to determine certain instances (especially common words that do not have any context clues). These are tagged with the <<Ambig>> tag. Figure 6: Ambiguous person name tagging Finally, in the fifth mechanism the algorithm maintains a list of person name instances found in the free text notes as a list of potential names, the entire note is then re-scanned (1) to find names that are un-capitalized, or (2) words that match the names in this new saved list. In the first case of identification of un-capitalized names, only those names which have been excluded from the common word list has been considered to avoid terms (such as bill, green etc.) to be misclassified as potential PHI. The second case of the fifth mechanism is motivated by the observation that the same names often reappear in the notes for a single patient. The partner of the patient may visit often, or the same clinicians may see the patient during his/her stay. Generally in the first time of entry of such names in the notes, it is preceded by an indicator and thus can be de-identified by the third mechanism of use of context references as described above. However, subsequent occurrences of the same name may not be preceded by any apparent context keywords. Thus, saving the list of names and re-scanning the text helps identify any other PHI instances of names that may still be in the free text patient notes. An example of this has been shown in figure 7. In the example the first instance of Bob can be determined using context clues but subsequent occurrences may not have context clues and is hard to capture as a name. The above mechanism mentioned will capture both instances of name. Figure 7: Tagging of relative names (using context clues) 33

47 Location Names According to the HIPAA guidelines, PHI to identify locations is defined to be any geographically precise location name (see figure 3). Since the JISC dataset was observed to have contained only local and neighbouring location names that appeared as PHI, a list of UK cities was compiled and saved in a dictionary. This dictionary is used to match each scanned word in the input text file against the dictionary provided. All identified PHI content are tagged with the <<LocationName>> tag. During the iterative development of the de-identification algorithm, it was observed that some of the location names were in un-capitalized format. So, the algorithm has been defined to identify all location names even in un-capitalized format. A further enhancement to the algorithm was the use of regular expression for identifying post-codes (see appendix D.1.4). The format for UK post-code is generally represented as a mixture of numbers and letters, for example, AB3 6AB, ab3 16ab, Ab12 9ab etc. The algorithm runs the regular expression against the text file and all patterns that are matched will be identified as possible PHI and tagged with the <<Postcode>> tag Dates Medical notes have a rich content of dates mentioned within the unstructured free text. The HIPAA guidelines stipulate that dates pertaining to patients (e.g., birth-dates, admission dates, discharge dates etc.) have to be de-identified. Dates generally follow specific format types (such as dd/mm/yyyy, dd/mm/yy, dd.mm.yy etc. where d=date, m=month, y=year). The algorithm uses regular expressions to match such patterns to identify partial and full dates to tag them. The dates are then tagged with the <<Date>> tag. In some instances a problem is posed by textual references to calendar events that could also disclose the dates of certain events. For example, information that a certain patient was admitted to the hospital on New Year or references to common holidays (such as Christmas, Ramadan, Thanksgiving Hanukkah etc.) can also be used to infer the date of events or ethnic and cultural backgrounds of the patients that will significantly reduce the subset of matching patients in the database. So, although HIPAA does not specify such textual references as PHI, we deem these important. Thus as a part of this thesis, we have included such events to be identified by the algorithm and tag them as <<Other>> Telephone/Fax Numbers Patient and medical provider identities can be easily tracked down by telephone and fax numbers that can be part of the free text records. These numbers can also be easily mistaken for other numbers 34

48 related to medical dosage or measurements mentioned within the text. The software prototype thus evaluates complete phone or fax number occurrences within the text file and de-identifies only these potential PHI categories NHS Number/National Insurance Number (NINo) HIPAA guidelines mention the need for removal of medical record numbers and social security numbers. In the UK, the medical record number is referred to as the NHS number and similarly the social security number can be validated as the National Insurance Number. The general pattern for NHS numbers are (Eg: nnn-nnn-nnnn nnnnnnnnnn, where, n=number). Regular expressions are written for the algorithm to match patterns of this occurrence within the free text and tag them as <<NHSNumber>>. Similarly, the algorithm also checks for patterns based on regular expressions written to identify the National Insurance Number and tag them as <<NINo>> (see appendix D) Other PHI Categories HIPAA guidelines specify several other PHI categories that must be addressed. Thus the algorithm also searches for a wide range of other PHI categories that were not included in the JISC dataset. Subroutines of regular expressions have been incorporated to identify and IP/URL addresses within the free text records. The regular expressions search for patterns using contextual clues ( such http, web.,.org etc.). See appendix D for a list of regular expressions that have been used in the de-identification algorithm Un-identified PHI Categories Two PHI categories that have not been handled by our algorithm are biometric identifiers (such as finger prints, voice recognition) and photographic images. Photographic images and biometric identifiers uniquely determine the identity of the individual and release of such information poses risk to revealing the identity of the patient. Within our dataset of free text patient records no such links or files have been included that contain these two PHI categories. As a result, efforts to develop the algorithm focussed only on the textual PHI as presented earlier in this section. However, for other datasets it is of importance to ensure that any files that incorporate images or biometric identifiers be thoroughly scanned for occurrence of this PHI Discussion The de-identification algorithm approach explained in this thesis is generally applicable to free text patient records. 35

49 Two features of the de-identification algorithm have proved essential in the development and consistency of the algorithm performance. First, the de-identification algorithm consists of separate modules with each module identifying the patterns of a different PHI category. The algorithm's performance at recognizing one PHI category from another for example: identification of dates does not affect its performance at name recognition. This modelling has thus allowed for rapid iterative development of the algorithm along with repeated testing carried out at each stage. Second, each section of the text is processed by all the modules independently. Thus, the order in which each module parses the text did not pose an issue. Thus even if the modules were laid down in any linear order, the performance of the algorithm was still consistent at de-identification of the text. Thus it was easy to design, modify and improvise the algorithm Anonymizer Algorithm The anonymizer algorithm is processed as the second stage of the de-identification process. It is run after the tagged document has been generated. In this stage, the algorithm scans each line word by word and identifies all the tagged features in the tagged document. It then replaces each of the tagged data with just the identifiable tag name and removes the actual content. A further improvement included the second phase of the algorithm, where names followed or preceded by a name or context is replaced with just one instance of the <<PersonName>> tag to improve readability. An example has been shown in figure 8. Figure 8: Anonymization of tagged data At a later stage it was realised that having separate tagged and anonymized output files actually also helped the other researchers who had access to patient records. They could use the tagged data to help them refer to the content of the data. This was helpful to them as they could see in the tagged text the actual data content that was tagged to confirm that these were really PHI terms. 4.3 Deployment We have already laid down the functionality of the algorithm and its execution. In this section we will address the actual deployment of the software. 36

50 One of the requirements of the thesis is to allow the system to run on a different machine so that the user can run the prototype version on their personal system. This would allow them to run the software on their set of patient records. However, before the system can be deployed on other platforms, it is necessary to package the contents of the file. The de-identification algorithm was initially developed using two stages as discussed under section 4.2. The two stages included separate class files, the tagger and the anonymizer algorithm that were executed separately. For the final de-identification prototype however the two stages have been merged into a single algorithm. The de-identification algorithm also makes use of a list of dictionary files to identify PHI within the documents. So in order to aggregate these Java class files and associated resources (such as dictionaries saved as text files) into a single file, the JAR (Java ARchive) has been used. The JAR[66] file packages the contents of the main de-identification algorithm class file and related dictionaries as a single executable file with the.jar extension. The jar file allows for the deployment of the algorithm and associated resources efficiently. The JAR files are built on the ZIP format and the elements in the file are compressed together with the ability to download the entire application in a single request, making it much faster than separately downloading each of these files individually (see appendix G for creation and execution of jar files). The de-identification algorithm class file and dictionaries have thus been packaged as a single downloadable JAR file termed Med_Anonymizer. This toolkit can then be downloaded easily anywhere on the system. A further description of the deployment process has been discussed below (see appendix I that contains the user manual for a detailed understanding if necessary) User Task 1: Downloading and Installing the Toolkit The Med-Anonymizer toolkit has one zipped folder named Med_Anonymizer.zip that can be downloaded as a single software package. The following steps need to be followed: Download the Med_Anonymizer.zip folder anywhere on the system and extract its contents. Figure 9: Med_Anonymizer toolkit (zipped file) Once the toolkit has been downloaded and all the contents have been extracted, the Med_Anonymizer folder will have the following files namely Med_Anonymizer.jar, PersonList.txt, LocationList.txt, PPersonList.txt, CommonWord.txt and Input.txt files. 37

51 Figure 10: Files within the Med_Anonymizer toolkit To first check if the tool has been installed correctly, run the Med_Anonymizer.jar file from the terminal or double click on the jar file. You can then view the three output files namely the Tagged.txt, NewList.txt and Anonymized.txt, once the tool has been installed correctly. An error message will pop up in case the toolkit has not been installed correctly. Figure 11: Output files after running the Med_Anonymizer toolkit User Task 2: Running the Med_Anonymizer jar file The user will first need to provide the input file that needs to be anonymized (steps are provided under user manual in appendix I). Once the input file has been provided, the user will need to run the Med_Anonymizer.jar file. The de-identification algorithm that has been defined then reads the input file and all the tagged PHI instances along with the data content are saved in the Tagged.txt file. The anonymized data is stored under a separate Anonymized.txt file. 4.4 Chapter Summary Overall, the chapter first discusses the implementation of dictionaries and the development of the gold standard corpus. Further it also provides a discussion on the implementation and challenges that were faced during the development of each of the PHI modules. The resultant results that were obtained and the performance of the toolkit have been addressed in the next chapter. 38

52 Chapter 5 Evaluation This chapter discusses the evaluations that have been carried out to determine the performance of the de-identification algorithm. Based on the feedback and results that were carried out for the different evaluations the de-identification algorithm has been improved iteratively. The final results of the deidentification algorithm have been discussed in the following sections. 5.1 Evaluation of Tools and Techniques To measure the performance of the system, the following evaluation mechanisms have been carried out: 1) Evaluation against the gold standard corpus to determine the performance of the deidentification algorithm. It also measures the performance against manual de-identification results 2) Evaluation of the consistency in its performance on different datasets a. Ghana test corpus b. India test corpus c. Leeds cancer centre test corpus 3) Comparative study with an existing system (GATE-ANNIE) 5.2 Evaluation Metrics To determine how well the de-identification algorithm can identify the PHI categories, the data is evaluated based on how well the content has been classified into a PHI category and removed. For this purpose, the below performance measures have been considered. 39

53 5.2.1 Performance Measures To determine the correctness of PHI classification, we have evaluated by computing the number of correctly recognized PHI instances (true positives), the number of incorrectly assigned instances (false positives) or those that were not recognized by the system (false negatives)[67] and finally text that has been considered as non-phi (true negatives) Precision, Recall and F-measure To evaluate the effectiveness of the anonymized data, we use precision, recall and F-measure. These are the standard evaluation metrics used in NLP and Information Retrieval research. Precision is defined as the proportion of true positives among all terms identified as PHI by the software (that is true positives plus false positives) [67]. The equation below makes this a simple concept to understand. Equation 2: Precision Recall is defined as the proportion of PHI identified by the software (true positives) out of all instances of PHI in the text (true positives plus false negatives) [67]. Equation 3: Recall F-measure is defined as the harmonic mean of precision and recall [67] and is represented as Equation 4: F-measure 40

54 5.3 Evaluation using Gold Standard Corpus The gold standard corpus which has been tagged by two annotators and reviewed by the third as discussed in section has been used for measuring the performance of the de-identification algorithm Algorithm performance and results The algorithm s performance was tested during each stage of the prototype s iterative development by comparing it with the gold standard corpus. During the development stage, the PHI identified by the algorithm s output is compared against the list of PHI mentioned in the gold standard corpus. Examinations of examples of errors made by earlier versions of the system led to adaptations to deal with these errors and hence iterative improvements have been made to the system. There were certain discrepancies that could not be correctly identified by the algorithm, such as a name which could refer to a person or location. Such instances have been tagged by the algorithm with just the <<Name>> tag. Since the instance had been determined as PHI by the algorithm and removed, for calculation purposes such information has been considered as a true identification (true positive) while comparing with the gold standard corpus. The final de-identification algorithm output was evaluated against the PHI in the gold standard corpus and the calculations for the algorithm results have been measured (see Table 4). The input file for this run included manually enriched and all free text patient records. However, the dates with only the year mentioned have been excluded from Table 4. The PHI category mentioned as year in Table 2 refers to dates when a particular surgery or treatment occurred. This has been currently not addressed by the de-identification algorithm as HIPAA guidelines have excluded all year occurrences as mentioned in figure 3. The overall precision and recall for all categories inclusive has then been calculated for the algorithm and has been recorded as and respectively. 41

55 PHI Type Person Names Location Names (excluding postcode) PHI count PHI count False False correctly (Gold positive negative identified Corpus) count (fp) count (fn) (tp) Precision Recall Dates Other dates (e.g. 14 th Feb 2011 etc.) Other PHI Total PHI count Table 4: Performance on gold standard corpus Discussion The de-identification algorithm performed considerably well with identifying PHI information correctly. Among the three main categories of person name, location name and dates in the JISC corpus, the location name and dates scored high. Only two instances of location names and one date was missed from the entire corpus. This was mainly because a customized dictionary of only UK cities and local places had been included in the dictionary. The record containing two instances referring to a person from Egypt had been missed. Although a more wide range of locations can be included from GeoNames[58], it was observed that when customized dictionaries for identifying location names were used, the algorithm performed better. This customization was important to ensure locations smaller than a state and hospital names are removed by the de-identification algorithm. However, we can improve this by having two lists, one standard list with all major cities and countries and a modifiable smaller list that includes particular specific location names. 8 Other PHI Telephone number, , IP, URL, postcode etc. 42

56 In terms of algorithm recall performance, false negatives were observed for person names, which was the most challenging PHI category. Among these relative names and word misspellings were the names that were generally unidentified, since the unidentified names do not match the names provided in dictionaries. Currently context information has been used to improve the identification. For example, with the use of simple heuristics, expressions are written to identify words preceded or followed by Mr., Doctor, father etc. as PHI by the algorithm. But, sometimes the false positive count rises with inclusion of these context rules. In the case of our algorithm, some of the instances were incorrectly tagged as names. Thus, the rules designed for the de-identification algorithm were based on carefully examining the trade-offs between the number of false positives and false negatives. But even with the extensive dictionary and context references, word misspellings were still a major source of difficulty while de-identifying free text patient records. In addition, names which could be common terms (such as May, Brown etc.) were identified as false positives. Although context checks can be used, it is very hard to determine such ambiguous terms. These have thus led to some common terms being identified as person names and increased the false positive count. However, in order to improvise the algorithm to incorporate word sense to identify a PHI type within the free text patient record, the Stanford POS tagger was considered to be incorporated with the algorithm to further analyse ambiguous terms and classify it as PHI or non-phi. However, the implementation of the POS tagger did not prove very efficient to incorporate within the deidentification algorithm to identify all missed names. This was because the patient notes were partly ungrammatical and the POS tagger was unable to rightly classify all the words as proper names. Also several medical terms were also identified as proper names. This would further require another dictionary to exclude all medical related terms first and then apply the POS tagger. However, with the sentences not being very well written there was a lot of common terms identified as proper names as well. So we realised that unless the POS tagger was trained, it would be less efficient to be incorporated with our algorithm. So at this stage we have not considered including the POS tagger. Also spell-checking libraries such as Ispell or Aspell[68] have not been used. Sometimes, in addition to the PHI categories specified by HIPAA, free-text context information may reveal a patient's identity. For example, "the patient had an appendectomy in 1998". Further investigation with probable knowledge of the source of the medical data and access to other records can reveal details about the event and the identity of the unique patient. Scrubbing such contextual information is an extremely difficult task and has not been taken into context here but can be considered in the future. Overall, de-identification systems have the potential for widespread use in research and information sharing purposes. The de-identification algorithm described here is sufficiently generalized to handle 43

57 anonymization of different datasets. Further discussions on performance of the algorithm on other medical datasets have been mentioned later in this section. Finally, it was also observed that the de-identification algorithm performance was better than manual de-identification by a single annotator. Also, the time taken to generate the output files was much more efficient than manual de-identification. Manual de-identification revealed that although the precision of tagging a PHI category might be much more accurate, the number of missed instances was considerably more in comparison to the number of missed PHI instances by the de-identification algorithm. Since the identification of PHI categories is more important for medical datasets we can say that the performance of the deidentification algorithm fared much better than manual anonymization. 5.4 Evaluation with Ghana Test Corpus In order to test the algorithm on a dataset that has not been used for the development of the algorithm itself, a different set of free text patient records was required to test the algorithm. But access to a different source of patient records was not an easy task. However, the system was developed bearing this in mind. Although the dataset could not be obtained directly to evaluate the results, the tool could be deployed on different operating systems with ease and hence could be provided to people with access to patient records. Whilst finding a source to evaluate and test the tool, a PHD research student Samuel Danso, studying at the University of Leeds, currently working on medical narratives of verbal autopsies for determining cause of death using machine learning was identified to be a valuable source. Samuel had access to a corpus of verbal autopsies that he was working on that contained free text patient records describing text relevant to patient s death. The free text content within the corpus contained PHI instances that would need to be anonymized. For the purpose of the project, the text file proved to be a valuable source. Thus, to evaluate the toolkit itself, the tool along with a user manual (see appendix I) describing about the process to run the tool as well as the dictionaries provided was then given for trial. Feedback and results on the process and capability to de-identify PHI information in the verbal autopsy corpus text was then collected and analysed. The evaluation results and discussion has been mentioned below Results and Feedback This section evaluates not just the performance of the de-identification algorithm, but also the use of the tool. For the tool that had been provided, a discussion about the entire process has been mentioned 44

58 here. Also for the evaluation of the PHI categories and instances present in the corpus and the missed classifications have been done manually since no gold standard corpus had been developed for the Ghana test corpus. Following the instructions in the user manual, the installation and running of the med-anonymizer toolkit was found to be quite easy as reported by Samuel. The tool was then evaluated against a small set of Ghana verbal autopsies and the results have been reported here. The input file chosen for the evaluation consisted of about 15 patient records with an average of over 200 words per record. The corpus chosen consisted of 66 PHI identifiers related to location names and person names. For the first run itself, only two PHI categories were identified. This was mainly because the verbal autopsy corpus itself was rich with PHI related to locations in Ghana that were not identified. So the location dictionary was modified to include local regions in Ghana. Feedback about updating the dictionary process revealed that it was simple and easy to access and modify. An evaluation was carried out and a summary has been provided in Table 6. PHI Categories PHI count False negative count Person names 5 3 Location Overall Table 5: Ghana test corpus PHI count PHI Data Classified as positive Classified as negative Classified as positive (tp) 52 (fn) 14 Classified as negative (fp) 2 (tn) -- Table 6: Ghana test corpus performance The overall precision and recall of the system was calculated to be and respectively. 45

59 5.4.2 Discussion Most of the PHI instances that were not identified were mainly because the dictionaries were not completely populated. For the person names, although most of them were identified from the list of English names, the local names were not identified and this can be improved only by improving the list of person names. However, context references of names were well identified. Location names included in the dictionary were also chosen from a small list. But the PHI identified based on the list was almost accurate and this provision of providing dictionary files that could be populated according to the corpus proved to be significantly useful. Also, several local names that appeared to be in both the person and location dictionary list had been tagged with the <<Name>> tag when the algorithm could not estimate the context reference appropriately. The overall false positive count was observed to be quite low. One of the instances such as father Hails in the sentence, identified Hails as a name, due to context checks to identify names by the algorithm. But this was just a local verbal communication form and Hails did not actually refer to a person. However, such instances would be very difficult to classify. These are challenges with local colloquial terms used within the text and would be hard to accurately determine. However such misidentifications only accounted for a small number and did not actually make the readability hard. This also showed that the expressions written in the de-identification algorithm were quite general and was quite useful for this dataset. Although a very small number, spelling errors of names and locations still caused a number of PHI instances to go un-identified. But the overall system performance was still high, although probably for a larger dataset more errors might have been revealed. In conclusion, the prototype that had been developed also proved to be quite useful for his project. One of the issues that the researcher was facing was to make the corpus available to other research students working in the same area. Since Samuel had ethical permissions signed off to work on the data it was easy for his research work. However, for him to provide it to others he needed it to be 46

60 anonymized first and the tool proved to be an effective solution for his task (see appendix F.1.1 for Samuel s comments). 5.5 Evaluation with India Test Corpus Another source was sought after to test the de-identification algorithm as well and receive feedback on the performance of the Med-Anonymizer toolkit. Nikita Desai and Lukasz Aleksandrowicz from the Centre for Global Health Research (Li Ka Shing Knowledge Institute, St. Michael s Hospital, Toronto, Canada) were identified to be potential sources as well. They were working on a project with access to free text patient records obtained from India. These records were not yet de-identified and were useful for our evaluation. They were given the same toolkit and instruction manual to test and report their observed results. They tested the de-identification algorithm on 22 medical narratives, with each narrative containing close to 250 words that were all taken from verbal autopsy cases from the Million Death Study in India. The algorithm was run on each case two times: once before including information about the names of the deceased, the respondent and the mother in the known name dictionary and once after including this information in the dictionary. The results that were obtained from them as well as a general summary of the performance of the prototype including any issues that were observed with the anonymization of the records have been detailed out in the next sub-section Results and Feedback Similar to how the Ghana test corpus was evaluated, the algorithm was run in two sessions. Feedback from the first session helped with revision of some of the algorithm rules. The results that were then obtained from the second session have been calculated in Table 7 and a discussion about the system performance has been provided later in this section. During the first trial, the de-identification algorithm was observed to be quite consistent in taking out names of doctors based on context checks of titles. But due to the fact that a lot of the names in the narratives were not listed in the dictionary (since typical Indian names were quite different from typical English names) the de-identification algorithm was not able to take out most of the names in the free text before the dictionary was updated. Once the dictionary was updated to include the list of known names, all names spelt correctly in both the database and the free text were taken out successfully. However, any misspellings of names in the free text resulted in the name not being tagged. In addition, short forms of names (Arvind vs. 47

61 Arvinder) were also not tagged. These were expected as the algorithm was not designed to capture such instances. Most dates were successfully removed from the free text. However some date formats that were different from the ones the toolkit was initially designed for were not removed. A list of dates that were not captured was provided for improving the algorithm to include these formats. Location names were not anonymized (which made sense, since they were all Indian cities or districts). After populating the dictionary most of the location names were anonymized. However, locations that were in the dictionary list but in lowercase within the free text records were missed. This was because the de-identification algorithm had not been written to check for un-capitalized location names. This also allowed us to further improve the algorithm rules to consider un-capitalized location names and also improve the expressions to identify more date patterns. The updated de-identification algorithm was then provided to them. Since there was no gold standard corpus against which these results could be compared, we asked them to provide a quantitative feedback to calculate the results. A record of all the missed and incorrect instances that was provided to us has been used for determining the recall and precision. The values recorded have been shown in table 7 and calculations have been done using the same. PHI Data Classified as positive Classified as negative Classified as positive (tp) 90 (fn) 2 Classified as negative (fp) 1 (tn) -- Table 7: India test corpus performance The overall precision and recall of the system was calculated to be and respectively. 48

62 5.5.2 Discussion Observations of the resultant file revealed that only two names had been missed due to abbreviation and spelling errors. Otherwise the de-identification algorithm results proved to be very efficient in anonymizing the data content especially with the dictionaries being well populated. One other observation was that among the small set of records chosen, only a few misspellings had occurred within the corpus. This could possibly be because the staff in-charge of updating the records in the system had entered the text in a well written format. In conclusion, the de-identification algorithm performance revealed considerable good results for the dataset. Also, amongst the data identified as PHI very few instances had been wrongly identified. Such errors existed because the notes itself are sometimes not always well written and do not follow a format. Overall with the performance of the prototype, both researchers were impressed with the results of the de-identification algorithm. Feedback regarding the tool has been listed in appendix F Evaluation with the Leeds Cancer Centre dataset As mentioned earlier in Section 3.1, the Leeds cancer centre dataset was a crucial necessity for design of the de-identification algorithm. However, with the limitations of access to the dataset other work around methods was used. In this section, we were finally able to get some insight about the data stored in the PPM database and a discussion about this has been provided Overview The Leeds cancer centre has all their patient records stored on a single PPM database. The PPM database is an integrated information and data management system[69]. It has over 1.2 million patient records stored within the system with each patient record having several free text patient notes stored over their entire period of treatment till date. Within the database, for every patient record a set of free text patient notes have been recorded. These notes contain several years of patient history, diagnosis and treatment provided for their course. Each patient has roughly between 30 to 190 such medical notes with close to 200 words per medical note. All patient record data is stored in the PPM database under several headers. Each patient entry is recorded in the database along with the date and associated medical notes under different headers. In addition, a set of over 200 referrals of associated clinicians, doctors, name of kin etc. for that patient 49

63 has also been recorded under separate headers. These also were the most important and relevant headers for our de-identification algorithm. The patient and referrals were important to populate the dictionaries and the free text medical notes were required to test the anonymization itself. Once the tool prototype had been designed it was provided to the centre. They were also provided with the toolkit and user manual for them to run their patient records and provide us with some constructive feedback. At the beginning of August, Dr. Geoff Hall (Senior Oncologist, Leeds cancer centre) who was working towards anonymization of the patient records was able to meet us and give us his input on the results. During the meeting he showed us how the data was actually stored within the database. He also ran the de-identification algorithm and provided us with some qualitative feedback on the results. A discussion of this has been provided in the next sub-section Results and Feedback For the purpose of the evaluation, an input file containing over 191 concatenated free text patient notes related to a single patient record with close to 36,000 words was chosen. PHI content of names that related to the patient s name, relatives, clinicians and mistyped spelling errors with patient s name were observed. In addition a few location names and dates were also present. The first run of the de-identification algorithm performed reasonably well and tagged most instances of location names and dates but only did a fair job with the person names. The names were however considered to be of high importance for this dataset. Since the dictionary had not been updated to contain the list of patient and referral names, only those instances found due to context checks or those names already present in the dictionary was identified. They were however confident that once the dictionary would be updated with the known list of patient and referral names, it would improve the de-identification considerably. However misspelt names that were noticed in the file would still pose a difficulty during de-identification. Unlike the previous evaluations, one of the considerations that they thought would be useful for them was if we could separate out the PHI identified as a person name into patient name and also incorporate possible tags for referrals. Another good option for them was if we could also in the future provide a solution such that all the annotated text could be re-saved into the database. Thus taking into consideration the needs that were required, further enhancements to the algorithm has been considered. A discussion has been mentioned next in this section. 50

64 5.6.3 Enhancements made to the algorithm Before we began the enhancements for the algorithm, based on a few observations that were noticed, design considerations had been made. Firstly, the three datasets that we have looked at so far were the JISC dataset which consisted of fictional data, Ghana and India verbal autopsy dataset. All three datasets had for every patient record roughly about 200 words for that patient record. The two verbal autopsy datasets only collected cause of death of a particular patient and hence only one note for that patient had been recorded. Although for the JISC dataset medical students had recorded history of a single patient during the training sessions separately, only a RowId had been used and since the same patient id had not been provided it was hard to determine which of the RowId s referred to which patient. So instead of separating out individual patient records we had considered all the records into a single corpus. However, for the Leeds cancer centre dataset, each patient had roughly 30 to 200 notes recording the history of the patient illness, treatments and medications given. Instead of considering the entire dataset as the input file, we decided to treat each patient s set of records as a single file. This was mainly considered because one of the challenges we had faced while designing the deidentification algorithm earlier was that it was not able to apply spell check on the file. We decided that treating each single patient s record separately would allow us to have a header text that could possibly be of importance to (1) separating out the patient name from other names and (2) to apply a spell checker on the patient s name. Based on this each input file would have a header that had the patient s name at the beginning of the file followed by the body of the text that contained a concatenated list of medical notes that needed to be anonymized. An example has been shown in the figure 12. Figure 12: Leeds cancer centre input file For the implementation itself, although we were able to re-modify the algorithm to an extent we were not able to get them to test and provide us with feedback on the results. However, the few modifications made to the de-identification algorithm and a review on how some of the other features were going to be utilized in the de-identification algorithm has also been mentioned here. 51

65 Firstly, the de-identification algorithm reads the header and saves the name. Every time the name is found in the text file it will replace the name as <<PatientName>>. This was tested on a test file and it worked efficiently. The algorithm replaced all found instances in the file taking into consideration even un-capitalised names. The patient records also had around 200 referrals to consultants, clinicians, name of kin etc. This has been populated in the name dictionary and lexical matching has been carried out to identify all names from this list. Further all context matching and heuristics are applied to the text to identify any names that are not within the dictionary list (for example patient s relative). All these names are tagged as <<PersonName>>. The most useful advantage of having the name field as the header is that you can use the spell checker to identify any misspelt instances. In the case of using a spell checker, one of the implementation difficulties earlier was that there were no headers including the patient names. The cancer dataset however had a header with the patient s name followed by all the annotations till date for that patient. A spell checker to identify misspelt instances using the patient name as the reference could possibly be much more efficient. One reason of not using the spell checker on the entire list was that several common words were also identified as names (for example, if Hall appeared in the dictionary list then instances of shall were also identified as spelling errors). Restricting it to take into consideration only the name mentioned in the header and applying the spell checker could make this more efficient. However, if Hall was the name mentioned in the header then instances of shall would still be identified but this would be limited only to that annotation file and not the entire data corpus as the algorithm is run on each file separately. However applying further heuristics could possibly solve the issue. These were some of the design considerations and modifications that have and can be made to the deidentification algorithm that can be applied for the Leeds cancer centre dataset. 5.7 Evaluation with another system: GATE (ANNIE) To further analyse how well the de-identification algorithm performed in comparison to an existing system that is capable of annotating text files, we have used the open source GATE system. This is a natural language processing framework and model that was available here at the university which has also been considered here as it is one of the most popular NER software that is currently being used. ANNIE is distributed freely with GATE. This has been used to tokenize the input texts, split sentences, look up tokens in its dictionary, produce part-of-speech tags and finally the NE transducer 52

66 to identify entities based on JAPE grammar rules. The predefined entity types ANNIE recognizes, including person name, date, address, location, job title etc. overlap with the entity types defined as PHI and thus the system is useful to evaluate how precisely it would annotate such text content in the free text medical records. The JISC corpus was provided as the input file to ANNIE. From the resultant annotated text from ANNIE only those that were considered as PHI were analysed (namely person name, date and location). These were also the main PHI categories within the JISC corpus. It was noticed that applying ANNIE directly did not yield good results. The entity types identified were not defined in precisely the same way as those classified as PHI. For example, date entities found by ANNIE may include year and time that were not exactly PHI. Some of the other observation made were, the ANNIE person name and location name recognizer uses lookup in a dictionary (which was later modified to also include known names as well) and JAPE rules to identify names in the texts. This recognizer did not perform well on the data, probably because it was developed mainly for the newswire domain, which is somewhat different in style to free text patient records[37]. Also, the content of medical text was not very well written compared to newswire systems. Our algorithm used similar features of dictionary look up and regular expressions similar to JAPE grammar rules and was able to perform better. Secondly, there was other PHI content which our algorithm was capable of identifying. JAPE grammar rules can be written to incorporate such features. However, our evaluation was based upon how well an existing system was capable of identifying PHI without any modifications made to the system except for enhancing the dictionary by populating it with the set of known PHI names. Thirdly, it was hard to determine why a particular date was not tagged and why only one instance of a place Headingley was tagged and no other instances were tagged. Also, some of the expressions our algorithm uses was capable of saving data content for that rule and use it with other expressions to further determine if it was PHI. In addition, the ANNIE system performance to annotate the same size corpus took considerably longer amount of time compared to our de-identification algorithm. Table 8 shows the calculations obtained from ANNIE. 53

67 PHI Type Person Names Location Names (excluding postcode) PHI count PHI count False False Correct (Gold positive negative (ANNIE) Corpus) count (fp) count (fn) (tp) Precision Recall Dates Other dates (e.g. 14 th Feb 2011 etc.) Other PHI Total PHI count Table 8: ANNIE performance results The overall precision and recall for all PHI categories was calculated to be and which was very low compared to our de-identification algorithm performance results. Guo et al. also considered the task of identifying PHI in medical reports as a NER task. The team had also used the open source GATE system [26] and ANNIE to pre-process and annotate a training set. They also noticed ANNIE on its own performed very poorly and thus multiple features were added empirically to by the developers to achieve higher accuracy of PHI detection. The Guo system performance was however recorded to be below average[16]. Training the GATE system would considerably improve the system performance. But this would be efficient only if a large available data corpus from different medical datasets is used because if the system had to run on different datasets using only one training file it would prove to be less efficient on other datasets. These are some issues which our de-identification algorithm has been able to address with the use of expressions to identify patterns. 9 Other PHI Telephone number, , IP, URL, postcode etc. 54

68 Another feature is that GATE is capable of only tagging the text files but actual anonymization of the text file would subsequently require another system that could then read in the annotated text file and then remove the main PHI related tags from the file to anonymize the data. Our de-identification algorithm addresses both the tagging and anonymization. Trying to just incorporate ANNIE to identify PHI in the free text patient records is not efficient. Considerable changes would need to be made to improve the JAPE grammar to identify patterns within the text. 5.8 Chapter Summary The de-identification algorithm has been used to anonymize free text patient records. Our algorithm has been tested on four different datasets and we have received good results. The algorithm also has shown sufficiently good results of performance in comparison to the manual de-identification both in the speed of performance as well as with the identification of the PHI categories. For the initial development dataset itself we have evaluated the results against the gold standard corpus which has been tagged by two annotators and reviewed by a third. The resultant findings revealed a few missed instances mostly with spelling errors and names not in the dictionary. These are observed even in other datasets. The words misspelt are a major source of difficulty in de-identifying free text. Although extensive dictionaries of known PHI and context clues have been used to identify instances of names, locations, etc. spell-checking libraries such as Ispell or Aspell[68] have not been incorporated. Since names are the most dangerous type of PHI to be missed, an obvious extension would be to look for likely misspellings of the patient's name if known a priori. This has thus been one of the considerations for the Leeds cancer centre dataset because the name field was available as a header at the beginning of the input file. In addition to the PHI categories specified by HIPAA, our de-identification algorithm has also addressed other context information that may reveal a patient's identity. However there are some PHI instances that can still be missed. But despite the risk of such unintended PHI disclosure, it is not considered feasible to have a clinician review every de-identified record to ensure removal of all PHI content. In fact even with two annotators reviewing the text we were unable to get 100% accuracy with the de-identification of all PHI instances. The de-identification algorithm we have developed is reliable and with further improvisations to the algorithm we can almost achieve a good score. In conclusion, the prototype designed along with its few shortcomings is quite useful. With further improvements to the algorithm it would be more efficient with de-identification of patient records. 55

69 Chapter 6 Conclusion In this chapter the overall project is concluded by evaluating if the minimum requirements set out to be achieved have been met. It also discusses the contributions made. Moreover, the chapter also describes the exceeding requirements that were achieved and the possible areas for extending the project in the future. 6.1 Thesis Summary (Project Evaluation) This section addresses how the minimum aim and requirements have been addressed and provides a conclusion for the summary of the thesis Aim and Minimum Requirements In this section we detail out the aim and minimum requirements that were set out and a description of the way in which it has been achieved has been mentioned below: 1) To identify potential de-identification processes or techniques that will be suitable for the design of the prototype: In chapter 2 we discuss the possible approaches for deidentification. These techniques have then been revisited during the analysis and design phase in chapter 3 while building the de-identification algorithm. We have also carried out a comparative evaluation of our de-identification algorithm with one of the systems which is discussed in section ) Build a bespoke prototype tool that anonymizes the PHI categories within one dataset: Chapter 4 discusses the implementation of the de-identification algorithm, the tools and libraries used, the techniques and methods applied on the data file and finally a discussion of the PHI categories that were present in the dataset and the use of expressions and dictionaries to identify these features. Finally, the actual anonymization algorithm which removes data identified as PHI has been discussed in section

70 3) Produce the resultant data in an appropriate format that is suitable for researchers: In section 3.4 we discuss the use of having a tagged data file which has the data content along with the PHI tag. Section 4.4 further discusses how each PHI category has been tagged. The final anonymized data which is the resultant can then be used by researchers Exceeding Requirements 1) Run the prototype version on more than one dataset to evaluate its consistency in performance: Section 5.4, 5.5 and chapter 5.6 discuss how the de-identification algorithm performed on different datasets. Although initially it was to be tested only on the cancer dataset, other datasets were also used to evaluate the performance of the algorithm. This actually led to making considerable changes to improve the algorithm. 2) Enhance the prototype version to include other feasible types of PHI categories that needs to be anonymized: Section 4.1 discusses all the possible PHI categories that have been addressed by the algorithm. Originally it was sufficient to just anonymize the main PHI that was present in the corpus. But to improve the system, the algorithm was improved to include other PHI. 3) Enhance the algorithm to incorporate more advanced techniques to enhance the deidentification process: Section 5.6 and 5.8 discusses the use of POS tagger to improve the system. A discussion of the use of Ispell has also been addressed in section However, learning new advanced techniques and incorporating them still posed quite a challenge Conclusion Overall, the de-identification algorithm process was noticed to be quite efficient. However, names were still the most difficult to anonymize. Other PHI instances for a record even if missed, posed very little probability for an individual to recognize the patient, since most of the data for that record was already anonymized. For example, if a date did get missed for a record all other PHI information had been well removed and it gave very little probability for re-identification. 6.2 Challenges 1) One of the major challenges was research on the topic was quite extensive, but very little knowledge on the functionality of the algorithms used were mentioned to incorporate them in our system. 57

71 2) The review found that other systems were internal developments by medical research teams who had access to their own data, but we had only limited access to the cancer patient data which was the original goal for the project. 3) The use of various methods and advanced techniques was not possible due to the limited time. Although some methods were tried, incorporating the features within our algorithm still posed quite a challenge. This led us to use simple methods and principles that the author had previous knowledge on due to the limited time. 4) Designing a system comes with a lot of time spent on testing and improving the features, which was a considerable strain as all the work was being done by a single individual. 5) Inclusion of several new modifications was only possible after trial on different datasets. Also it was quite difficult to incorporate all finer details. 6) As mentioned earlier, one of the challenges of designing a system based on rules and expressions required a comprehensive study of the dataset. This required that a considerable amount of time be spent on reading through the files and identifying commonly occurring patterns. 6.3 Contributions In this thesis, we present solutions for the de-identification of the patient records. Our solution uses dictionaries and pattern based rules to tag and anonymize PHI in the free text patient records. The algorithm also takes into account several context references to further determine the probability of whether the information is PHI or not. Although, not completely accurate, it has improved the overall performance of the system considerably. Secondly, we have provided an interim tagged file that allows clinicians to further review PHI content that have been misclassified or not tagged. This tagged data information would allow for a much detailed study of missed instances that will allow developers to not only improve the prototype but also provide a much accurate and efficient system. Thirdly, one of the features is that the tool can be easily downloaded and run on different machines without the need and knowledge of learning any technical NLP tools. This has allowed us to provide it to different research individuals for trial and feedback. Also, the tool has proved useful for their said research work as anonymization of patient records had been one of the primary concerns in their research. The development of the gold standard corpus would also be useful for other researchers who would like to try and test a different de-identification approach using the JISC dataset. 58

72 6.4 Future Work This section discusses the enhancements and possible areas of future projects that can be taken up by other MSc students as an extension for this project Enhancements The following are future enhancements that can be applied to increase the algorithm performance: 1) One of the areas of development was to train the POS tagger on medical datasets to improve its tagging capability. 2) Use word sense to remove misspelt names. An obvious extension is to enhance the current algorithm to look for all possible misspellings of the patient's name, as names are the most dangerous type of PHI to miss. 3) Sometimes for research purpose it is of some importance to retain dates within the medical records[25]. This is mainly because it is sometimes necessary to monitor the patient s stay and the progress of his/her medical condition[25]. This however went beyond the initial scope of this project, which was only to identify and remove PHI categories within the free text. Thus we have anonymized all date content. For future work we can suggest options of whether the date feature should be anonymized or not. 4) For the PPM dataset, one of the further enhancements includes integration of the medanonymizer toolkit with their database. A further version of the algorithm would then be to access the data rows directly from the database which would act as the input file before the algorithm is run. The output would then be saved as a separate file on the database of the system. Thus the next version would be to build and integrate the two systems. So a possible future work might be for the cancer research group to take up the project internally for further development with someone who has access to the cancer data. This could possibly also be to employ us as cancer researchers once ethical rights have been approved Future Projects The following are proposed project areas based on this project: 1) Use machine learning on the anonymized text to identify disease and treatments 2) Develop a more refined algorithm incorporating features mentioned in further development 3) Build an interim feature to allow selection of the PHI content to be anonymized. 59

73 Bibliography 1. NHS. NHS Confidentiality Code of Practice [Accessed 19 August 2012]; Available from: asset/dh_ pdf. 2. HIPAA. HIPAA Website [Accessed 25 August 2012]; Available from: 3. PHI. Protected Health Information Website [Accessed 25 August 2012]; Available from: 4. Code of Federal Regulations Website [Accessed 25 August 2012]; Available from: 5. Uzuner, O., P. Szolovits, and T.C. Sibanda, Was the patient cured?: understanding semantic categories and their relationship in patient records, 2006, Massachusetts Institute of Technology. 6. Website. CRISP-DM process [Accessed 13 June 2012]; Available from: 7. Boehm, B.W., A spiral model of software development and enhancement. Computer, (5): p Larose, D.T., Data mining methods and models, 2006, John Wiley & Sons, Inc.: Hoboken, NJ, USA. p. xvi, 322 p. ill. 25 cm. 9. Wirth, R. and J. Hipp. CRISP-DM: Towards a standard process model for data mining Citeseer. 10. Shearer, C., The CRISP-DM model: the new blueprint for data mining. Journal of Data Warehousing, (4): p McConnell, S., Rapid development: taming wild software schedules. Vol : Microsoft Press. 12. Tveit, A., et al. Anonymization of general practitioner medical records Forster, A., et al., The incidence and severity of adverse events affecting patients after discharge from the hospital. Annals of Internal Medicine (3): p

74 14. Mudiayi, T., A. Onyanga-Omara, and M. Gelman, Trends of morbidity in general medicine at United Bulawayo Hospitals, Bulawayo, Zimbabwe. The central African Journal of medicine, (8): p Meystre, S.M., et al., Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform, : p Meystre, S.M., et al., Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology, (1): p E-Health. NHS Code of Practice on Protecting Patient Confidentiality. 2012; Available from: Szarvas, G., R. Farkas, and R. Busa-Fekete, State-of-the-art anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc, (5): p Heinze, D.T., M.L. Morsch, and J. Holbrook. Mining free-text medical records American Medical Informatics Association. 20. HIPAA. HIPAA Privacy Rules [Accessed 13 June 2012]; Available from: Thomas, S.M., et al. A successful technique for removing names in pathology reports using an augmented search and replace method American Medical Informatics Association. 22. Thacker, S., B. HIPAA Privacy Rule and Public Health June 2012]]; Available from: Taira, R.K., A.A.T. Bui, and H. Kangarloo. Identification of patient name references within medical documents using semantic selectional restrictions American Medical Informatics Association. 24. Dorr, D., et al., Assessing the difficulty and time cost of de-identification in clinical narratives. Methods of Information in Medicine-Methodik der Information in der Medizin, (3): p Neamatullah, I., et al., Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making, Douglass, M., et al. Computer-assisted De-identification of free text in the MIMIC II database Gupta, D., M. Saul, and J. Gilbertson, Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. American Journal of Clinical Pathology, (2): p Berman, J.J., Concept-match medical data scrubbing. Archives of pathology & laboratory medicine, (6): p

75 29. Beckwith, B.A., et al., Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Medical Informatics and Decision Making, (1): p Sweeney, L. Replacing personally-identifying information in medical records, the Scrub system American Medical Informatics Association. 31. Sweeney, L. Guaranteeing anonymity when sharing medical data, the Datafly System American Medical Informatics Association. 32. Ruch, P., et al. Medical document anonymization with a semantic lexicon American Medical Informatics Association. 33. Uzuner, Ö., et al., A de-identifier for medical discharge summaries. Artificial intelligence in medicine, (1): p Wellner, B., et al., Rapidly retargetable approaches to de-identification in medical records. Journal of the American Medical Informatics Association, (5): p Cunningham, H., et al., Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science, Aramaki, E., et al. Automatic deidentification by using sentence features and label consistency Guo, Y., et al. Identifying Personal Health Information Using Support Vector Machines Gardner, J. and L. Xiong. HIDE: An integrated system for health information deidentification IEEE. 39. Hara, K. Applying a SVM Based Chunker and a Text Classifier to the Deid Challenge Am Med Inform Assoc. 40. Karen, T., et al., De-identification of primary care electronic medical records free-text data in Ontario, Canada. BMC Medical Informatics and Decision Making, Friedlin, F.J. and C.J. McDonald, A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association, (5): p Friedman, C., et al., A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association, (2): p Friedman, C., et al., Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association, (5): p Friedman, C., et al. A WEB-based version of MedLEE: A medical language extraction and encoding system American Medical Informatics Association. 62

76 45. Morrison, F.P., et al., Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes? Journal of the American Medical Informatics Association, (1): p Uramoto, N., et al., A text-mining system for knowledge discovery from biomedical documents. IBM Systems Journal, (3): p Abulaish, M. and L. Dey, Biological relation extraction and query answering from medline abstracts using ontology-based text mining. Data & Knowledge Engineering, (2): p Cohen, A.M. and W.R. Hersh, A survey of current work in biomedical text mining. Briefings in bioinformatics, (1): p Jackson, P. and I. Moulinier, Natural language processing for online applications: Text retrieval, extraction and categorization. Vol : John Benjamins Pub Co. 50. Dybkjaer, L., H. Hemsen, and W. Minker, Evaluation of text and speech systems. Vol : Springer Verlag. 51. Cunningham, H., Software Architecture for Language Engineering, 2000, Ph.D thesis, University of Sheffield. 52. Cunningham, H., et al., GATE: an Architecture for Development of Robust HLT Applications. 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), Phildelphia, PN, 2002: p GATE. Gate Website [Accessed 17 August 2012]; Available from: I2b2. I2B2 Website [Accessed 19 August 2012]; Available from: JISC. JISC Website [Accessed 18 August 2012]; Available from: Atwell, E., et al. e-health GATEway to the Clouds. 2012; Available from: NameLab. Name Lists [Accessed 19 August 2012]; Available from: GeoNames. GeoNames Website [Accessed 19 August 2012]; Available from: Atkinson, K. Spell checking oriented word lists, Revision May 2008]]; Available from: Roberts, A., et al., Building a semantically annotated corpus of clinical texts. Journal of Biomedical Informatics, (5): p

77 61. Roberts, A., et al. Semantic annotation of clinical text: The CLEF corpus Ogren, P., G. Savova, and C. Chute. Constructing evaluation corpora for automated clinical named entity recognition Herbert, S., Java: The Complete Reference, 2006, TATA McGraw Hill Publishing Company Ltd. 64. JAVA. JavaTM 2 Platform Standard Edition 5.0 API Specification [Accessed 19th August 2012]; Available from: Java. Java Libraries [Accessed 22 August 2012]; Available from: JAR. JAR Website [Accessed 22 August 2012]; Available from: Sokolova, M. and G. Lapalme, A systematic analysis of performance measures for classification tasks. Information Processing & Management, (4): p Ispell. Ispell Website [Accessed 22 August 2012]; Available from: Newsham, A.C., et al., Development of an advanced database for clinical trials integrated with an electronic patient record system. Computers in Biology and Medicine, NIN. National Insurance Number Wiki Website [Accessed 27 August 2012]; Available from: 64

78 Appendix A A.1 Project Reflection During the course of the project, there were several challenges that were faced with the development of the prototype. There were also some lessons learnt from the time spent working on the project. In this section a discussion about my personal challenges and experiences has been provided. This will enable future MSc students to gain insight from some of the lessons that were learnt. One of the challenges of the project was that it was based mainly on designing expressions suitable for the patterns within a dataset. This required both a comprehensive study of each PHI category that needed to be identified as well as recursive design of the patterns. Also with considerable learning of patterns in different datasets refining and testing was required simultaneously. A modular design approach for the de-identification algorithm had been used to easily modify just one section of the algorithm without affecting the other sections. However it was required that the implementation of new rules was not too over fitted to one dataset as this would introduce errors not only for that dataset but also other datasets. Studying on the course of computing and management and having only a basic understanding of the programming concepts, it was more or less unknown how well the methods implemented would be fruitful. Also, it required consistent learning of relevant and different methods to see how well they could be incorporated for this project. However, by exploring different areas, sufficient knowledge had been gained to reach this stage of the project. I would also like to provide some advice to the future MSc students reading this. Once you have been allocated a supervisor in Februray, discuss and choose an interesting project that you would enjoy working on. For me, although the topic had been suggested by my supervisor I was extremely excited while working on developing the tool. It was this sheer enjoyment that allowed me to be motivated every single day to work endlessly towards improving and designing the system. Also having worked in a similar area during my undergraduate course I knew that choosing a project benefiting medical research is something I would be able to look back upon and feel a sense of gratification. This was further emphasized when we got to meet Dr. Geoff Hall and work with the cancer dataset that was initially what this project had been aimed at. This was a very interesting and amazing experience for me. 65

79 Some of the learning points for me started at the very beginning, of understanding that data was a critical requirement to start and proceed with the project. It is thus very important to bear in mind that irrespective of being well organised, it is necessary to anticipate issues when relying on a different source to provide data. Since the course of the project is only over a short duration it is very necessary that the researcher be well aware of this before starting work. For my project, with sufficient advice and help that I received from the PHD students allowed me to find a quick work around. But this could not be the case all the time and students should be aware of this. Another thought is to make sure that you provide yourselves the time. It is also important to have a well-structured project plan and follow through. Also it is very important that you meet with you supervisor weekly and allow yourself to ask and be questioned about your work. They will always be glad to guide you and you have to take this opportunity to make sure you challenge yourself well. During the course of my project, I used to carry my questions and relate work achieved for the weekly meetings. This helped me be on track and time my work. Although my supervisor guided me at every step, he also introduced me to several other research students working in the same area of research. This was really helpful as they were aware of possible issues that I could run into with my work and this guidance from them helped me make some wise decisions. My suggestion is that you should always try and meet with other people in your area of research as they provide you with insights that would help you with your work. Also I would like to add that the project duration period does not have any lecture classes like you generally have during the other two semesters. This means that you tend to stop seeing most of your classmates as there are no lectures being held. So although it is tempting sometimes to do work from home I found it very beneficial coming and working at the University. Working with some of my colleagues tirelessly through the night especially at the University during the last few weeks of the term really pushed me to do work. Your friends are always your best stress relievers when work seems to be endless. I had an enjoyable time with them during these days and I would also encourage other MSc students to keep close contact with your classmates as they are your best support especially for some of you who are away from home. Finally, I would say that working on this project has been an outstanding experience. This was my first dissertation that I had carried out and the learning was a sheer joy. The feedback that I received and the time that I had spent working on this project, has made some of my really low days seem bright and joyful. Completing this project has really consolidated my learning from the whole year and I feel that I am better placed for the new challenges and opportunities that might come knocking my way. 66

80 Appendix B B.1 Interim Report Comments: Supervisor 67

81 B.2 Interim Report Comments: Assessor 68

82 69

83 Appendix C C.1: Project Plan-Gantt Chart Figure 13: Project plan gantt chart 70

84 Appendix D D.1 Regular Expressions This Appendix gives examples of the regular expressions used in the Med-Anonymizer software written in Java. Expressions have been written to identify PHI content within the free text provided as input and follow the format required to represent them in Java. Expressions are generally written within brackets and quotes such as (expression goes here). The expression [0-9] indicates a digit. The expression [A-Z] indicates all capital letters. The expression [a-z] indicates all lower case letters. The expression \d matches numeric. Numbers or letters in a pair of curly braces following the expression indicate the number of digits or letters for the match. For example, \d{4} matches a 4 digit number. The expression \s matches white space. The question mark indicates an optional expression; + matches the preceding pattern element one or more times; whereas * indicates a match for 0 or more times. The vertical bar separates alternative expressions. The expression \w matches alphanumeric, \b matches word boundaries. Within the algorithm these expressions are used to identify patterns within the free-text and a few examples are provided and explained here. D.1.1 NINo (National Insurance Number) To develop the expression, the format of representation of the number needed to be considered. It is represented by two prefix letters, six digits and one suffix letter. Following the guidelines for representation of the national insurance number, the first two letters can be D, F, I, Q, U or V. The second letter also cannot be O. The prefixes BG, GB, NK, KN, TN, NT and ZZ are not allocated. he following modifications have been made to the regular expression: The first letter may not be D, F, I, Q, U or Z; the second letter may not be D, F, I, O, Q, U or Z; the final letter is optional. Matches: JG103759A AP019283D ZX047829C Non-Matches: DC135798A FQ987654C KL192845T 71

85 D.1.2 An representation example that can generally be within the text is etc. To understand the regular expression that has been designed to identify an id, we will divide it into smaller components: Firstly, the address must begin with alpha-numeric characters (both lowercase and uppercase characters are allowed). It may have periods, underscores and hyphens. There must be symbol after initial characters. After sign there must be some alpha-numeric characters. It can also contain period (. ) and and hyphens( - ). After the second group of characters there must be a period (. ). This is to separate domain and subdomain names. Finally, the address must end with two to four alphabets. Having a-z and A-Z means that both lowercase and uppercase letters are allowed. This will allow domain names with 2, 3 and 4 characters e.g.; us, tx, org, com, net, wxyz). D.1.3 IP Address The above expression is used to identify any valid IP address format represented within the range ( ). D.1.4 Postcode - identifies words in a string with the first two letters represented in any case followed by one or two digits followed by a space. digits followed by letters in any case. - identifies the rest of the string which has one or more D.1.5 Date There are two expressions for identifying dates 72

86 The above expression is used to identify dates represented in the numeric style separated by /. or - symbols. Most of the dates within the notes are represented in this format. Expressions for date patterns such as 3rd June 2003 or 25th Dec 2007, have been evaluated using the pattern below where month contains a string that represents month of the year (such as January, Jan, February, Feb, etc.). 73

87 Appendix E E.1 Presentation delivered at Progress Meeting, July 2012 Figure 14: Presentation slides 74

88 Appendix F F.1 Comments on the software prototype F.1.1 Samuel Danso, Kintampo Health Research Centre, Ghana Niki, Just a quick feedback on your tool. I went through the process and I must confirm that it works pretty well. I ve just run it over some VA samples and was able to identify all locations names in the text and anonymized correctly. Thanks for this fantastic tool. This tool should be useful to further anonymize confidential information found in VA documents. Best,Sammy F.1.2 Nikita Desai, Li Ka Shing Knowledge Institute, St. Michael s Hospital, Canada Installing the tool was fairly easy, but it was important to first read the manual to understand the steps required. Overall the software itself is quite intuitive, so it was not hard to figure out how to work it. The software is useful, in the case of our verbal autopsies, it would be very useful, as it would save us a lot of time manually going through our narratives and anonymizing them (which we have to do whenever we are sharing data). F.1.3 Dr. Geoff Hall, Senior Oncologist, Leeds Cancer Centre Geoff Hall met with us personally and was quite impressed with the tool. This provided the much needed encouragement and motivation while writing the thesis as we were able to see hands on the tool s performance on the actual cancer patient records. 75

89 Appendix G G.1 Implementation of Jar file To create a Java executable, we need to package all the contents of the Java class file and dictionaries into a single jar file. Before creating the jar file a manifest file has to be created. The manifest file is used to define the extension and package-related data. Our entire de-identification algorithm is a single class file named as Med_AnonymizerCode and has the.class extension. The manifest file thus has the name of the class file (as shown in figure 9) Figure 15: Med_AnonymizerCode class and manifest file contents The terminal is used to package the contents of the Med_AnonymizerCode class into the Med_Anonymizer.jar file. The command lines used to package the contents have been shown in figure 10. Figure 16: Command line arguments to create jar file *cvfm are arguments for creating a.jar extension. 76

90 Appendix H H.1 Tag Names 1) Person - <<PersonName>> 2) Location/ Hospital Names - <<LocationName>> 3) Date - <<Date>> 4) Telephone/Fax Number - <<PhoneNumber>> 5) - << >> 6) URL - <<URL>> 7) IP - <<IP>> 8) NHS Number - <<NHS>> 9) NINo Number - <<NIN>> 10) Other - <<Other>> (Refers to ethnicities or ambiguous data) 77

91 Appendix I I.1 Code The actual program code has been written in Java and both the class file and the Java file have been made available on the CD attached to this report. It is under the folder name Program Code on the CD. I.2 Toolkit The toolkit named Med_Anonymizer has also been made available on the CD under the folder name Med_Anonymizer Toolkit and can be downloaded and viewed. I.3 User Manual A separately attached copy of the user manual has also been provided along with the project report to view contents on it. User Manual Version 1.3 Nikita Raaj University of Leeds School of Computing 30 August nikki.raaj@gmail.com / sc11nr@leeds.ac.uk Med_Anonymizer Toolkit The Med_Anonymizer toolkit is a software product designed to automate the process of identifying and removing Protected Health Information (PHI) from free text patient records. 78

92 B A C K G R O U N D In the UK, the patient confidentiality privacy rule restricts exchange of medical data containing PHI, defined as any information that might be used to identify the individual(s) from whom the data were collected. Data known to contain PHI cannot be shared for research purposes unless they have been made un-identifiable or have been anonymized. To make this data available for research, the text files need to be anonymized first and this tool has been designed for this purpose. S O F T W A R E P A C K A G E The Med_Anonymizer toolkit is a free software package and can be downloaded for patient records that require the PHI indicators to be anonymized. The author of the software package is Nikita Raaj, MSc student at the University of Leeds. P R E R E Q U I S I T E Med_Anonymizer package has been downloaded and tested on Mac OS X(Version ), Windows XP and on Linux Systems. The code is written in Java using Eclipse(Version:3.7.1) and the tool is a downloadable zip file. This tool can be used by downloading it on any system compatible with running jar files. D O W N L O A D I N G A N D I N S T A L L I N G T H E M ED- A N O N Y M I Z E R P A C K A G E The Med_Anonymizer toolkit has one zipped folder named Med-Anonymizer.zip that can be downloaded as a single software package. Steps to download and install the package: Download the Med_Anonymizer.zip folder anywhere on the system and extract its contents. 79

93 Once the toolkit has been downloaded and all the contents have been extracted, the Med_Anonymizer folder will have the following files namely Med_Anonymizer.jar, PersonList.txt, LocationList.txt, PPersonList.txt, CommonWord.txt and Input.txt files. To first check if the tool has been installed correctly, run the Med_Anonymizer.jar file from the terminal or double click on the jar file. You can then view the three output files, namely the Tagged.txt, Anonymized.txt and the NewList.txt, once the tool has been installed correctly. An error message will pop up in case the toolkit has not been installed correctly. S P E C I F I C A T I O N S O F E A C H F I L E 1. Input.txt - The Input text file is used to edit and include the data file that needs to be anonymized. The file saves only text inputs and the dataset contents should be converted into a text readable form and saved in the Input.txt file 2. PersonList.txt - This file contains a list of extracted person names. This list can be modified by including the list of names that might occur in the input dataset to tag and anonymize any of the occurrences within the dataset. 3. Location.txt - This file contains a list of cities within the UK. This list can be modified according to the user requirements to include more cities. 4. CommonWord.txt - This file contains a small list of common words. 80