Pseudonymisation Implementation Project (PIP) Reference Paper 4

Transcription

1 Pseudonymisation Implementation Project (PIP) Reference Paper 4 Pseudonymisation Technical White Paper - Design and MS-SQL FV2 24 th March 2010 Without Prejudice

2 Programme NPFIT Document Record ID Key Sub-Prog / Pseudonymisation Implementation Project Project (PIP) NPFIT-GUIDANCE-4 Prog. Director J Thorp Version FV2 Owner Status Final Authors Wally Gowing/John Nickson Version Date 24/03/2010 Document Status: This is a controlled document. Whilst this document may be printed, the electronic version maintained in FileCM is the controlled copy. Any printed copies of the document are not controlled. Related Documents: These documents will provide additional information. Ref Doc Reference Number 1 NPFIT-FNT-TO- BPR NPFIT-FNT-TO- BPR NPFIT-FNT-TO- BPR NPFIT-FNT-TO- BPR Title PIP Implementation Guidance Reference Paper 1 - Terminology Reference Paper 2 Business Processes and Safe Havens Reference Paper 3 De-identification 5 PIP Planning Template and Guidance 1 6 TBA SQL Code Examples as a Standalone Set of SQL Version FV1.1 FV1.1 FV1.1 FV Page 2 of 42

3 CONTENTS 1 Introduction Purpose and scope Context The Design Process Introduction Step 1 Establish the project Step 2 Review the Business Requirements Step 3 Identify Technical Requirements Step 4 Review the Options Step 5 Determine the technical solution Step 6 Plans Step 7 Review the risks Step 8 Iterate requirements, risks and the technical solution Tables in support of Figure System Design MS SQL Server Introduction Code Samples Summary of Testing Approach Design issues Pseudonymisation methods Other considerations Implementation with SQL Server Preparing data for pseudonymisation Pseudonymisation methods Extracting a value from a hash to create a public pseudonym Putting it together Pseudonymising dates Indexes Appendix 1 Other useful sources Appendix 2 Mechanisms for Creating Random Numbers Appendix 3 List of code samples TABLES Table 1 - Weaknesses in provider operational systems...13 Table 2 - Identification of Individuals at Risk...13 Table 3 - Secondary data to General Practice...14 Table 4 - Clinical Audit...14 Table 5 - Data Quality...14 Table 6 Tracing and validation of Practice/PCT...15 Table 7 - Spatial analysis...15 Table 8 - Range and length of extracted varbinary data...23 Table 9 - Extract lengths to support postcode and NHS Number...32 FIGURES Figure 1 - Sample Use of identifiable data...8 Figure 2 Pseudonymisation - design...19 Figure 3 - Different input formats can give rise to different pseudonyms...25 Figure 4 - The hashbytes function...30 Figure 5 - Example of code to create a salted pseudonym...30 Figure 6 - Extracting a fragment from a hashed string as a candidate public pseudonym...32 Figure 7 - Beware of binary data types when extracting fragments...32 Figure 8 - Pseudonymisation process...33 Figure 9 - Approaches to Generating pseudo-random Numberss compared...39 Figure 10 - Distribution of integer subset of hash function over 1,000,000 cases...41 Page 3 of 42

4 1 Introduction 1.1 Purpose and scope This paper provides guidance on technical aspects to support local implementation of deidentification for the Pseudonymisation Implementation Project (PIP) and links to the paper on de-identification (Ref 4) The purpose of the paper is to: Provide a design and development context for implementation of technical solutions Set out basic principles for technical solutions Provide a toolkit for undertaking technical implementation Provide sample code for enabling existing systems to be modified to facilitate use of pseudonymised data The focus of this paper is on practical approaches to the design of a system that allows the creation and maintenance of pseudonyms. This is in the context of local organisations that need to support a mix of primary and secondary uses of patient level data with data drawn from a range of sources. Whilst it is intended to be a standalone document for supporting technical implementation, references to other PIP guidance documents are included for readers to access additional relevant material The first two sections set the context for the technical implementation, whilst the later sections are aimed principally at technical staff involved in the implementation of solutions for de-identification. It should be noted that this paper is specifically targeted at NHS organisations and potentially their suppliers; it is not intended for more general consumption. The paper complements PIP Reference Paper 3 Guidance on De-Identification (Ref 4) It is also important that both the technical approach to implementation and associated access controls are considered in the context of the business and associated Information Governance requirements that the application must support. Sections one and two of this paper are generic; sections three and four of this version support implementation for MS- SQL Server Users. An equivalent document is currently in production to provide an Oracle version of sections three and four The paper contains technical language and a multiplicity of acronyms, not all of which may be defined, but which are expected to be understood by NHS IT staff. 1.2 Context The aim of de-identification through pseudonymisation is to allow data to be assembled and analysed at person level without the need to reveal identity. An effective pseudonym will destroy any structure within a relevant field which might allow it to be reconstituted or otherwise allow an individual to be identified (other than through a secure and pre-defined mechanism). However, the pseudonym must maintain sufficient information from the original identifiable text to provide a consistent basis for discriminating different cases and associating those cases which are the same The implementation of pseudonymisation is particularly useful when information crosses system boundaries and domains so that data is not being handled within a single security framework. This is common even within a single organisation, for example where data from multiple operational systems is gathered into a single data repository/ warehouse for secondary analysis. Outside the organisation, there may be a need to provide data to third parties who need to be able to distinguish individuals without the ability to identify them It should be added that this paper is targeted at users of medium sized databases and data repositories of the size supported by NHS Trusts, PCTs and shared Health Informatics Page 4 of 42

5 Services. While some reference is made to performance issues, these are of greater significance for very large databases (such as those at national level) and this may require some change to the relatively simple techniques identified here It cannot be emphasised too strongly that pseudonymisation is not an alternative to the maintenance of rigorous security across all the layers of the solution. A security failure at any one of these layers can mean failure for an entire solution; areas of concern include: Security policies and human behaviour Network security (firewalls, ports, encryption) Operating system security Server level security (endpoints, server level logins, ports, protocols, and other surface area configurations) Database level security (granting rights to logins/roles/schemas, encryption options, and determining what rights are appropriate when) All solution components (such as ETL, and for SQL sites whether using SQL Integration Services (SSIS) or otherwise, SQL Server database, SQL Analysis Server (SSAS) and products used to support the reporting layer (including third party products, SQL Report Services, MS-Excel etc)) A full consideration of these issues is outside the scope of this document, although some useful references are identified in the bibliography. These references are not intended to be all encompassing This document provides guidance and suggested techniques and code that organisations can adopt to support pseudonymisation. Code examples are provided to illustrate the narrative only and no responsibility is taken for the correctness or reliability of their operation; it is the absolute responsibility of the user to test the functionality and operation of any proposed changes to the systems for which they are responsible and to ensure that such testing has been undertaken in respect of the systems they use. Page 5 of 42

6 2 The Design Process 2.1 Introduction Few NHS providers or commissioners have substantial in-house capability to develop and sustain complex IT solutions. Where the need for change to support de-identification does exist, it is likely that organisations will look to third party solution suppliers, multi-organisation collaborations or in-house services to provide suitable depth and continuity of expertise With possible limitations of skills and knowledge in mind, the general approach to implementing de-identification is to keep it simple. There is a significant risk of overengineering that can be avoided through a careful review of requirements. In particular, a requirement to maintain data both in the original clear identifiable form and within (or accessible from) a given database significantly increases both the functional requirements and the attention which must be given to the security infrastructure. The need for and scope of any such requirement should therefore be the subject of explicit review. The conclusions of the review on the approach to be taken will require sign-off by the relevant Senior Information Risk Officer (SIRO) A second key aspect of the overall approach is the need to take a strategic view of where pseudonymisation fits with business needs and how this is best handled given the systems or services available to the organisation. The starting point is to stand back and clarify the requirements and then to review the options available to implement them A series of logical steps that should be followed are set out below. Whilst this list may seem lengthy, all have been included to act as an aide-memoire in preparing a solution. The list and the guidance are a means to an end and are not intended to generate unnecessary work or take up inordinate amounts of time It should be noted that the author of this text is versed in MS Windows and MS SQL. Other operating systems and software packages are available and my use different terminology. An Oracle specific version of Sections 3 and 4 is in preparation. 2.2 Step 1 Establish the project The existing Implementation Guidance (Ref 1) and the associated planning templates have already set out the need for a well defined, sponsored project to implement the changes required to effectively deal with Local NHS Data Usage and Governance for Secondary Uses The overall change involved in revising arrangements for data usage and governance and the introduction of de-identification techniques is a classic project of moving from point A to point B. Organisations must have established a formal project to achieve this and, once they have identified a sponsor and a team they can plan the process based on the 12 essential steps, identified in the previously submitted planning template and maturity model. 2 In effect, the technical implementation should be a component of a wider implementation project. 2.3 Step 2 Review the Business Requirements Prior to the technical implementation stage of the project, the organisation should be clear about the requirements in relation to the use of patient data in both identifiable and deidentified forms in operating and undertaking its business and associated processes The default position should be that staff needing access to patient level data should be accessing de-identified data unless there is justifiable cause to access identifiable data. For instance is risk stratification analysis to be undertaken to support targeting of primary care services? If so, identifiable data will only need to be made available to a known group of clinicians with legitimate right of access to that data for that purpose, those undertaking the 2 Page 6 of 42

7 general analysis do not have such a right. Access to identifiable data should be documented and signed off by the organisation s Caldicott Guardian As indicated in the PIP Planning Guidance, organisations must have policies and procedures in place for determining who has access to identifiable data and for what reason, together with a process of approval of such access by the time the revised access to identifiable data is implemented After the technical requirements have been developed, see below, the interaction of the business and technical requirements should be reviewed to ensure consistency and where possible seek simplification in the technical design. 2.4 Step 3 Identify Technical Requirements The technical requirements should derive from the business requirements. This section sets out a list of some of the issues to be considered when determining how de-identification and pseudonymisation will be implemented in light of those. While the checklist below is relatively long, the approach should always be to look for options that avoid the need for complexity and not all cases will apply to all organisations. Step 3a - Identify flows of patient level data The need to identify existing flows and review their purpose and content against business needs has been indicated in Section 2.3 and is considered in more detail in the guidance documents (Ref 1, Ref 2, Ref 3, Ref 4). Knowledge about the flows and their usage is the starting point for developing the technical requirements. Step 3b Identify and confirm who needs to access personal identifiers and why The complexity of the approach adopted to implement de-identification within a local reporting system will depend on: Whether there is a need to access personal identifiers. Cases where the requirement is to de-identify data consistently with no requirement for re-identification are inherently simpler. The drivers underlying those needs, since these have wide variation in impact. The alternative approaches that are available to meeting those needs The examples in Figure 1 overleaf indicate some of the considerations that arise from different drivers which have come to the attention of the PIP Team. They bear directly on a number of key considerations: Who does the need to see data in identifiable form apply to? In what circumstances does the need arise and what is the requirement for reidentification? Which identifiers (i.e. identifiable data items) need to be accessible as identifiable data? How up to date does the reporting system need to be and what are the implications of this for the volume of data to be processed and the time available for processing? The entries in Figure 1 refer to supporting tables which, to aid readability, have been grouped together at the end of this section 2. Page 7 of 42

8 Sample use Weaknesses in operational systems in provider organisations Identification of individuals at risk Provision of apparent secondary use data to General Practice Support for clinical audit To enable data quality issues to be addressed To support tracing Spatial analysis Linking diverse data sources Unknown future requirements Figure 1 - Sample Use of identifiable data Driver As well as supporting secondary use, the reporting solution supports the front line of delivery of care by providing access to locally developed reports which are not available from the operational system. Organisations should consider weaknesses that existing within current operational systems 3 Analysis of the population using personal data drawn from primary care, interventions by primary care interaction with secondary care and other sources to identify persons where intervention will reduce the risk of transition to a worsened state of health 4 To meet the requirement from GPs to be able to identify data relating to those patients with whom they have a legitimate relationship so that they can check data quality and cross reference to their own expectations around the delivery of care 5 e.g. whether activity recorded in secondary care was consistent with expectations. To enable detailed audit (e.g. by pulling patient records) following analysis of a large sample of data 6 (including linked records). To enable primary sources to be checked when analysis of the data indicates inconsistencies. This includes checking anomalous cases which are identified in audit validation 7 To support the use of tracing services to confirm NHS Number status, practice and PCT. However, please note that the Secondary Uses Service now includes practice and PCT derivation against NHS Number via PDS, avoiding the need to trace to confirm these items. Maintaining data in identifiable form for tracing purposes rather than making use of this embedded derivation will therefore require particular justification 8 To allow the allocation of data to geographic areas to support mapping and other spatial analysis 9. A key reason for receiving data in identifiable form rather than prepseudonymised is the need to be able to link diverse datasets. Attempting to cater for unknown unknowns is not an acceptable justification for maintaining personal data or building functionality with no obvious rationale. However, two specific considerations are mentioned below, these relate to the proxy personal identifiers of data of birth and postcode. In the first case, there may be a case for retaining the ability to access the date in identifiable form when new events are to be added and age at event is to be maintained. In the case of postcode there may be a need to ensure the ability to map to revised 2011 Census boundaries to support analysis in Public Health. It is to be emphasised that these cases should be the subject of specific consideration and justification. 3 Table 1 4 Table 2 5 Table 3 6 Table 4 7 Table 5 8 Table 6 9 Table 7 Page 8 of 42

9 Step 3c - Identify any sensitive data flows Does any data require special treatment because of its sensitivity? For example, sexual health, GUM or addiction services may all require special treatment to avoid identification under any scenario or to ensure that there is no possibility of visibility outside the New Safe Haven. This is a particular concern in an environment where there is a stated requirement for patients associated with other datasets to be identifiable to a sub-set of users The simplest and safest approach is to anonymise such cases completely or not to publish them in data structures visible outside the environment of the New Safe Haven. It is also possible to identify and implement approaches which deliver internal linkage within the sensitive dataset based on non-reversible pseudonyms that are unique to that set The existence of a true need to go beyond this adds significant complexity and requires a very robust solution. Specific issues include: Strengthened authentication for any users able to see data in identifiable form The ability to maintain multiple pseudonymisations. Encryption requirements The stage of processing where de-identification must take place Targeting of audit functionality. Step 3d - Identify whether there is a need to support multiple pseudonymisations Applying different pseudonymisations to the same data is a way of segmenting the ability to link it and of reducing the risk that pseudonymisation will be broken by the creation of maps and the potential impact of any such breach. It is a major design consideration in the design of national systems which, for example, are intended to meet the needs of researchers However, it adds complication to design and will not normally be a requirement for a single organisation 10. Nonetheless, it is important that any requirement is identified as part of the review because such requirements are easier to incorporate into an initial solution than to retrofit Any requirement to maintain multiple pseudonyms is most likely to be associated with the need to handle especially sensitive data or from the need to send data to third parties. It is important to recognise that these requirements are likely to differ in detail: The first relates to the need to pseudonymise on or prior to input to the reporting solution, so that there is no point of contact with the data of special concern and other data where linkage might disclose sensitive information. The second case relates to output pseudonymisation. While this can be handled on a one off basis using the techniques set out below, the creation of general functionality is significantly more complex and outside the scope of this paper. 11 Step 3e Identify the data derivations that need to be supported This issue relates to the two proxy personal identifiers: date and postcode. In these cases the primary approach to de-identification is abstraction by derivations which reduce information content, although the maintenance of true pseudonyms is relevant where the data is to be used to support record matching. The first requirement is therefore to identify which derivations are required and ensure that these are pre-calculated. 10 Later discussion identifies the case for maintaining distinct root and output pseudonyms. The comment here relates to the need to maintain many output pseudonyms. 11 For example, it not only involves the implementation and maintenance of some form of key store but consideration must be given to the mechanisms available to associate the applied pseudonymisation with the relevant data set as metadata. Page 9 of 42

10 Otherwise, the strategic issue relates to whether there is a need to maintain the identifiable form data in full once derivations have been undertaken and how, if maintained, these will be kept secure It also needs to be recognised that some derivations involve such a low level of abstraction that, if maintained, they have the potential to act as effective proxy identifiers and should be subject to the same degree of security and confidentiality. Specific cases include: Grid references and other geocodes Census output areas. (For the 2001 Census, the minimum OA size was 40 resident households and 100 resident persons.) Age derivations at the level of days, weeks or months As far as the potential need to maintain full postcodes is concerned, the next Census is due on 27 March While it is intended that Super Output and Output Areas will remain relatively stable some change will be inevitable and some users/uses may need to remap data to match these. Step 3f Identify and address data quality issues which would give inconsistent pseudonymisation Inconsistently formatted data will generate inconsistent pseudonyms and this is discussed in detail in section The need is therefore twofold: Identify issues with historic data which must be addressed before and while implementing pseudonymisation. Identify current risks and issues and ensure that these are handled during processing on a continuing basis Sample code to address some of the above issues is included within the Appendices to this paper. 2.5 Step 4 Review the Options As indicated in Section 2.3, business and technical requirements should be reviewed with the aim of ensuring that the approach adopted offers the most effective way to meet requirements within the required timescale (which will inevitably be the simplest) Cases for specific consideration include: Is the reporting system being asked to maintain information which should and could be managed by operational systems? A case in point is the identification of individuals at risk who are identified by analysis of large populations. Once the cases have been identified, the better and simpler approach is to identify the individuals within the reporting system through a process of specific de-identification and then undertake further management through the use of an operational patient system (e.g. via a virtual Page 10 of 42

11 ward 12 ). The impact of adopting this approach is to significantly reduce the requirements the reporting system must support. Can any of the requirements be better addressed by enhancing operational systems? This is particularly relevant where the reporting system is being stretched to encompass identifiable data to address weaknesses in the reporting functionality of operational systems. Again the impact of adopting this approach will be to significantly reduce requirements around the reporting system. Requirements which involve the need to support cross-linkage of less sensitive and especially sensitive data flows or the de-identification of the latter require specific justification. 2.6 Step 5 Determine the technical solution The key need is to develop and document revisions to the existing systems and process design in light of the technical requirements, piloting if necessary, even where the proposed changes appear minor. The development of the technical solution should be used as an opportunity to check that the solution fully meets security requirements and should not be limited to the implementation of pseudonymisation alone Areas that should receive some consideration include the following. Structure SQL Server and Oracle have a wide range of sophisticated data management facilities and tools (eg Analysis services, Internet Information Services). The structure of the revised overall solution needs to utilise the benefits of these facilities and tools in a coherent fashion. Security model The security facilities inherent in SQL Server and Oracle need to be used as the basis of the overall security model in enabling different users to access the functionality and data relevant to them and only them In addition to this need to assure the overall security of the solution at all layers, specific consideration should be given to the following issues: Which additional objects and principals (groups and users) are to be created to support it and what are the rights of each> Is encryption required and if so of which fields? How are input data (e.g..csv files) to be secured? How will the value of salts 13 (see below) and keys be protected? Data quality checks The need to undertake data discovery and address data quality issues both at initial implementation and during load processing are discussed in detail in following sections below. Inconsistently formatted data will cause different pseudonyms to be created. Pseudonymisation may obscure data quality issues. They therefore need to be identified and rectified and/or flagged prior to pseudonymisation. 12 See en.wikipedia.org/wiki/virtual_wards 13 a salt comprises random bits that are used as one of the inputs to a key derivation function; a salt can also be used as a part of a key in a cipher or other cryptographic algorithm Page 11 of 42

12 Specific and General Audit As indicated in PIP Reference Paper 3, logs should be kept of access to identifiable data for audit purposes. Some of these logging and audit requirements may be met by features intrinsic to the software (eg SQL Server or Oracle) or additional logs (e.g. for use of any reidentification facility) may be required, or possibly combinations of both. A log of key events, such as an explicit call to a routine to undertake de-identification is a particularly valuable mechanism to ensure that Information Governance requirements are not being breached. Robustness of ETL and operational processes The instance of process failure or database corruption may be more severe when data is pseudonymised because of the absolute importance of maintaining referential integrity. There is a need therefore to review existing processes for robustness and ensure that they follow best practice: Are ETL processes sufficiently robust? What risks are associated with the new processes introduced to support pseudonymisation? e.g. new risks that an attempt will be made to implement a duplicate value on a primary key, causing processes to fail. How will the design avoid these and what additional checks are required? Is there a need to define transactions, improve error checking 14 and/or strengthen update logs? Is there a need to review the approach to database logging which may have consequent impact on the management of transaction logs? Is there a need to review policy around the back-up of cryptographic keys and certificates to avoid data loss through the corruption of either? How will application defined pseudo-keys such as salts be maintained and protected? Is there a need to review operational processes around routine database integrity checking? 15 Is there a need to change operational procedures in light of the above? Features required to support transition There is a need to identify the facilities required to support the transition from current processes and operations with identifiable data to operating with both de-identified and identifiable data: 2.7 Step 6 Plans Are development and test environments available? Is there an intention to support dual running and what are the implications of this for the solution? Is there a need to be able to change certificates, keys and salts between test and production versions? Is it intended to undertake phased implementation and what are the implications of this? Is the infrastructure (and particularly disk space) sufficient to support the processing required to restructure the database, should this be required? As part of a wider project, the technical implementation must be planned appropriately. An important component is ensuring that there are comprehensive test plans in order to 14 A further reason for using SQL Server 2005 or later is the implementation of the TRY..CATCH construction to allow improved error checking in T-SQL processes 15 For users migrating from SQL Server 2000 this includes a decision on whether to move to CHECKSUM rather than TORN_PAGE as an integrity check. Page 12 of 42

13 minimise risk, including some form of dual running to check that outputs are consistent using the different forms of data. 2.8 Step 7 Review the risks The penultimate stage is to review the risks that arise from the proposed design and approach to implementation and ensure that they are mitigated and controlled Key items for consideration include: Risks of irretrievable data loss inconsistent pseudonymisation, incomplete update processes Failure to maintain pseudonymisation e.g. Mapping revealed Unintended release of identifiable data failures in security model e.g. through accidental grant of implicit permissions, source data not secured, transient tables not cleared etc. Source data not secured. 2.9 Step 8 Iterate requirements, risks and the technical solution The final stage involves iterating through the above with the aim of ensuring the simplest and most robust approach to providing a technical solution that meets the business requirements Tables in support of Figure 1 Relevant data items Table 1 - Weaknesses in provider operational systems Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Update interval Identifiable data retention period 16 Key audit event Relevant data items Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Update interval Identifiable data retention period Key audit event All typically including patient name All or large sub-sets On-line Many Frequent - at least overnight Short generally less than 12 months from last event Any access Table 2 - Identification of Individuals at Risk All patient name, address and phone number will need to be accessed to support intervention Those identified as cases of interest only By re-identification of small subset of individuals defined through analysis of pseudonymised data Few Weekly update for initial analysis Medium last three years (may be longer, eg for smoking) Re-identification 16 Identifiable data retention period relates to typical periods for which such data needs to be readily available for queries/analysis as opposed to the legal requirements for retention of data Page 13 of 42

14 Relevant data items Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Table 3 - Secondary data to General Practice Update interval Weekly update 17 Identifiable data retention period Key audit event Relevant data items Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Update interval Identifiable data retention period Key audit event Relevant data items Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Update interval Identifiable data retention period Key audit event Comment: All typically including patient name All with a relationship to a given GP Practice, subject to any cases (implicit or explicit) where the patient would expect information to be withheld (e.g. some sexual health interventions) On-line (potentially via web) or via distribution of formatted data reports (which should be encrypted) GPs and identified individuals in GP Practices in respect of defined subsets Short generally less than 12 months from last event Any access Table 4 - Clinical Audit Local Patient Identifier + DoB (to confirm) Those identified as cases of interest only By re-identification of small subset of individuals defined through analysis of pseudonymised data Few typically restricted to clinical audit team, named individuals in individuals + some clinicians Weekly update Need to identify individuals will not usually extend beyond those recently receiving care, though the requirement is to link patient events over an extended period. Re-identification Table 5 - Data Quality Local Patient Identifier + DoB (to confirm) + any personal identifiers which are subject to query; potentially SUS spell-id and pathway identifier Those identified as cases of interest only By re-identification of small subset of individuals defined through analysis of pseudonymised data. Or access to clear data within New Safe Haven Few named individuals in information departments and/or medical records. Weekly update Need to identify individuals will not usually extend beyond those recently receiving care, though the requirement is to link patient events over an extended period. Re-identification or access to New Safe Haven Good practice is to centralise Data Quality reporting within the New Safe Haven. 17 Cases where a local reporting solution is used to provide operation data such as discharge letters or results to GPs are included in the first category Weaknesses in Current Systems for the purpose of the current discussion. Page 14 of 42

15 Relevant data items Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Update interval Identifiable data retention period Key audit event Comment: Relevant data items Scope of patients who may need to be identifiable Access to identifiable information Number of users needing to access identifiable information Update interval Identifiable data retention period Key audit event Table 6 Tracing and validation of Practice/PCT NHS Number + DoB + Gender All Visibility not required except for a small subset of cases requiring manual resolution Few named individuals in information departments and/or medical records Monthly update perhaps weekly Data requirement is transient and information could be cleared after tracing process is complete Re-identification where required to support manual resolution Assumes access to NSTS-like tracing functionality via PAR. There is the potential to use the PDS derivations with SUS to validate practice and PCT for SUS data. Table 7 - Spatial analysis Postcode & geocode derivation All Not required to be visible as potentially identifiable other than when point mapping at relatively high resolution is required and in some cases with regard to communicable disease None at a level which is personally identifiable, except in the special case where mapping is used to support communicable disease tracing. Weekly update Extended for postcode Re-identification Page 15 of 42

16 3 System Design MS SQL Server 3.1 Introduction The focus of this paper is on practical approaches to the creation and maintenance of pseudonyms in the context of local organisations that need to support a mix and primary secondary uses with data drawn from a range of sources; the discussion does not extend to more theoretical discussion as part of a wider whole systems architecture 18. The discussion encompasses two cases: Where there is no need to maintain personal identifiers to support analysis, but where there is a need to assemble person level data from a range of disparate sources which require linkage at person level. Cases where local data is accessed by a mix of users, some of whom have a legitimate interest in personally identifiable data but where others do not. (An example of the first group might include people identified as needing intervention by a PCT analysis of people at risk of readmission to hospital.) Native encryption and hashing technologies were added to MS SQL Server as part of SQL Server 2005 and form an important element in the pseudonymisation toolkit. SQL Server 2005 also saw the implementation of User/Schema separation (discussed further below) which strengthens the ability to secure systems which must meet the needs of a range of users. For these reasons: The discussion which follows does not apply to SQL Server 2000 and earlier products. Where there is a confirmed need to support pseudonymisation (which will be the norm), users who are currently using SQL Server 2000 or earlier should consider migrating to a more recent version of the product The process of migration from SQL Server 2000 typically involves a close consideration of database structure and processes. Users planning to migrate from SQL Server 2000 should consider whether to use this as an opportunity to implement pseudonymisation as part of the implementation process An issue has been raised about the mechanisms available to support pseudonymisation with MS-Access. There are a number of considerations: MS-Access does not in itself offer an adequately secure environment and should not be used to maintain clear data. The only exception is where its use is to support the analysis of small datasets on a fully encrypted machine. In that case, the data which is maintained in clear should be restricted strictly on a need to know basis for example, to date of birth, where there is a requirement to support multiple age related derivations, without the automatic assumption that other sensitive fields should be visible in clear. In addition, MS-Access lacks the backup, recovery and transaction logging functionality available in MS-SQL. Where users are seeking to pseudonymise the static content of an existing MS-Access database, or clear data maintained in other office products it is possible to use MS- SQL Server express as an engine to do so using the techniques set out below. An example will be found in Appendix See, for example, Page 16 of 42

17 While SQL Server Express supports column encryption, the fact that encryption keys are normally maintained within the SQL server environment means that physical security remains and issue Code Samples Summary of Testing Approach A number of approaches have been taken to testing of the proposed pseudonymisation function: for postcode, the PAF has been pseudonymised (2.4m records), both on the desktop and by WMCSS For NHS Number an initial check was made on approx 24,000 CFH synthesised test cases, with WMCSS checking a further 5m+ cases Dates a table has been pre-populated with pseudonymised look-up from 1890 to 2040 and checked A key consideration in checking the proposed approach is that, when correctly implemented the table which contains the cross reference between pseudonym and clear text, provides a rigorous check on the integrity of the approach. Key points of failure would be: The routine generates duplicates pseudonyms for differing inputs The routine generates more than one pseudonym for the same input Data is lost The cross reference tables which link clear data and pseudonym have a primary key defined on the root key (so that it cannot contain duplicates or be null) and unique indexes declared on the public (group or output) pseudonyms (so that these cannot contain repeated values, though in this case, for the reasons set out in the note, a null is allowed. So, if: The table has been successfully populated, and It has the same count is the input (once repeating values have been removed from the input) +1 (to allow for the NULL case which is added explicitly when the table is first created), and The clear values in that table match the input set, then Data integrity has been maintained and the pseudonyms have been created successfully. These conditions were met in all cases Further checks were undertaken on format and length The other risk around any process of pseudonymisation is that the input can be derived from the distribution of the pseudonymised values. Appendix 2 Figure 10 contains a check on the distribution of 1M numbers when subjected to the proposed pseudonymisation approach. This gives some reassurance, as does the known characteristics of a cryptographic hash function, which provides the basis of the approach The fact that the initial application of the algorithm for a short form pseudonym produces few clashes means that the process of resolving them (i.e. taking further passes through the data) to remove them is relatively simple. The need to ensure that there are no duplicates in these cases is now emphasised in the text as well as being incorporated in the code (the latter has always been the case). The checks on the production of short form pseudonyms set out above were applied to this code. 19 There are additional mechanisms available to mitigate this risk, particularly under SQL Server 2008 which provides support for external storage mechanisms. These are outside the scope of this paper. Page 17 of 42

18 3.2.7 That said, it is clearly critical that any user implementing pseudonymisation tests rigorously. The paper documents a wide variety of cases particularly where data quality is poor - which can derail the process if not taken into account and this much be checked as part of the specific solution. Code examples are provided to illustrate the narrative only and no responsibility is taken for the correctness or reliability of their operation; it is the absolute responsibility of the user to test the functionality and operation of any proposed changes to the systems for which they are responsible and to ensure that such testing has been undertaken in respect of the systems they use. 3.3 Design issues The need to support de-identification and to derive and assign unique short-form pseudonyms implies the need to maintain clear data somewhere within the application. This data must be maintained in a demonstrably secure environment This leads to the general principle that personal data should be maintained in clear values in one, and only one, location 20. Adopting this approach provides simplicity and clarity, in that there is only one source of the clear data to control. While design constraints may result in the duplication of physical tables between a relational staging area and dimensional tables in SQL Server Analysis Server the designdeisgn should ensure that these operate as a single virtual table The only exception is the maintenance of transient data as part of load processing, where access should be tightly limited to those directly responsible for maintaining the data The diagram below shows the resulting impact on design. Although the approach is a common to the maintenance of any pseudonymised value, the discussion below is couched in terms of the NHS number. Root pseudonym (Primary Key) Clear value (encrypted as required) Public pseudonym(s) 1-n IsNull flag Data quality indicators Associated data Last modified date.. etc Primary key Surrogate key Dimensional design (FACT) Foreign key Normalised Transactional table e.g. Finished General Consultant Episode 20 This statement is made for simplicity, It is accepted that the technical needs of clustering, disaster recovery etc may make this a single logical location 21. Note also the comments in relation to key length later in this table. 22 The security and confidentiality of the source extracts, backups and the use of temporary storage to process data are also relevant issues. Page 18 of 42

19 Figure 2 Pseudonymisation - design As shown in Figure 2, a pseudonymisation master table: Holds root pseudonym. The root pseudonym is used for internal linkage only and never exposed to users. It will be the primary key of the master table and will act as foreign key to other tables needing to recover a public pseudonym or clear values. In a dimensional model it acts as a surrogate key to link dimension and FACT tables. Maintains the clear value of the sensitive data, which may or may not be encrypted. Maintain one or more public (or output) pseudonyms 23. These are the pseudonyms which are exposed to the users who need access to a pseudonym; the ability to maintain and hold multiple public pseudonyms allows different pseudonymisations to be applied in different contexts There are strong reasons for not using the root key as a public pseudonym as, if compromised (e.g. by the creation of an external map between it and the clear text), the only option is to rebuild database tables in a way which changes the values of the key. In SQL Server, there are also design constraints which mitigate the use of the randomised and relatively long values which are appropriate to an exposed pseudonym as primary / surrogate keys. These are discussed at greater length below The length and format of public pseudonyms is constrained by the fact that they may need to be: shared between users to support the identification of a problem case without the need to refer to personally identifiable information; shown on reports without impacting adversely on existing formats; need to be handled by existing systems. These factors imply that public pseudonyms should be consistent in length with corresponding clear text if at all possible ( short-form pseudonyms ) As an internal value, the format and length of the root key is not a concern to users. However, where the root pseudonym will be used within a dimensional approach, for example within SQL Server Analysis Server, design considerations indicate that it should be as short as possible Other data elements within the pseudonymisation master table will be determined by the design and may include: An IsNull flag. This relates to the handling of records with NULL values for the clear data and this is discussed below. The table may also maintain data quality indicators for example, in the case of the NHS number this might include a flag to confirm that the input, clear, NHS number passed the modulo 11 check, and a consistency flag (set to fail if inconsistent data is identified on records for the same NHS number e.g. inconsistent dates of birth). Subject to local business requirements and system design, the table may also include other data relevant to the grain and dimension for example, there may be a case for extending the NHS Number pseudonymisation master table to include date of birth 25, to support consistency checking and for maintaining the most recent GP practice code associated with the patient to support access control. 23 also known as Group Pseudonyms in the context of SUS 24 The above assumes the need to support a limited number of public pseudonymisations where the total number of pseudonymisations is not limited a normalised structure may be required 25 In this case, allowance must be made for possible inconsistencies in the date of birth for example by taking the latest nonnull date of birth and flagging the record if inconsistencies are identified see code sample for a practical example. Page 19 of 42

20 Simple algorithms for deriving pseudonyms are discussed below. However, there are three broad approaches: Check the incoming value against the data maintained in clear and assign a new value if the incoming value is new, assigning the next sequence number given by the identity function as the root pseudonym for the new value. Derive a pseudonym directly using a cryptographic hash. Derive a short form pseudonym using one of the approaches set out below The first two of these are most relevant to the derivation of a root pseudonym: A particular risk with using an identity function is that the data to which it is applied is sequenced in some way which imparts meaning to the root key the generation of a date pseudonymisation master table and the initial load and creation of master tables are obvious examples. The length of a cryptographic hash (16-20 bytes) limits its direct usefulness where the result will be used as a surrogate key. Within SQL Server, a long random value is also a poor candidate as a clustered index and can give rise to poor performance on update and significant inflation in table size. However, the pseudo-random relationship of the hashed output to the clear text, together with the fact that it can be derived directly from the source can be useful as an intermediate step when generating root pseudonyms. For these reasons, the practical examples discussed in the next section tend to use a combination of both methods It is to be emphasised that where an existing system design already supports entities (such as a person dimension) which can be developed to support pseudonymisation with limited modification, where processes to maintain surrogate keys are already in place and where keys are populated, nothing in this document should be taken as indicating that there is a need to rebuild the existing database using a new set of keys. Rather the approach should be to: Ensure that the keys which act as root pseudonym are not exposed to users 26 ; Implement public pseudonyms as additional attributes In these circumstances, effort should be directed to: Ensuring that the root pseudonym is never exposed to users and is protected by design and strong access controls. Implementing support for one or more public pseudonyms, where users need access to a pseudonym. Implementing rigorously controlled processes to support de-pseudonymisation where de-pseudonymisation functionality is required. 3.4 Pseudonymisation methods A number of approaches have been considered as mechanisms for generating pseudonyms. These are summarised below; more detail will be found in the appendix. Cryptographic hash functions A cryptographic hash function maps strings of arbitrary length to strings of fixed length so that it is computationally infeasible: to enable the source text to discovered (pre-image resistance) 26 Accepting that technical staff may need to use the root key to build efficient new reports. Any such work on live data should necessarily be limited to the environment of the New Safe Haven. Page 20 of 42

21 to discover an alternative source input to be discovered by inspecting the output which would give rise to the same output. (second pre-image resistance) to find any two distinct inputs will give rise to the same output (collision resistance) A number of algorithms have been developed aimed at meeting these requirements, notably MD4, MD5, SHA-1 and SHA SQL Server currently supports all but SHA Although the application of a given algorithm to the same clear text will give the same result, alternative pseudonymisations can be generated by the application of a constant (or salt) to the input stream prior to hashing The attractions of a cryptographic hash as a mechanism for pseudonymisation can be summarised as follows: The length of the output is fixed and known in advance. The function is relatively fast in its operation (see the discussion below). Since the function produces a definitive result, it can be used to obtain the pseudonym algorithmically without the need to first check for an existing pseudonym in a pseudonymisation master table, though the need to maintain entries in a master table remains if there is a need to support de-pseudonymisation. Because the pseudonym can be re-derived from source data, providing only that the algorithm in use and the salt are known, it offers additional recovery mechanisms if the event of data corruption Because the output of a hash function is stable for a given input, salt and platform a cryptographic hash can provide a direct method of creating a pseudonym without reference to a look-up table, though such a table is required to support re-identification The primary disadvantage of using a cryptographic hash lies in the length of the result, which depending on the algorithm in use is 17 or 20 bytes (i.e. 34 or 40 characters when expressed as hexadecimal). This is not consistent with a requirement for a public pseudonym of the same length as the clear text being pseudonymised and also gives rise to issues around physical design and the suitability of the result for use as a primary key. Random sampling without replacement Each clear value is uniquely associated with a random number. 28 The process: First checks the pseudonymisation master table to find whether a pseudonym has previously been identified against the clear text If a pseudonym is found, that value replaces the clear text in the input record. If there is no entry in the pseudonymisation master table, a new pseudonym is required and is either taken from a pre-populated list of unused values or requested from a random number generator; the value checked to see whether it is already in use and the process repeated until an unused value is returned. The new value is updated to the pseudonymisation master table together with the clear text And the new pseudonym replaces the clear text in the input record. 27 SHA-3 is currently under development. 28 For multiple key use there may also be a need to store the key or indented target in the table as well Page 21 of 42

22 3.4.9 The advantage of the approach is that it can support the generation of pseudonyms of any length The key issue is the mechanism used to produce random numbers and a comparison of the approaches available in SQL Server is set out in Appendix 2, Figure Finally, special considerations apply where the set of data to be pseudonymised has a defined set of occurrences these include dates and postcodes. There are grounds for prepopulating tables of pseudonyms in these cases, ensuring that the pseudonyms are drawn from a very wide range of random values. Partial extraction from a cryptographic hash This approach contains some of the features of the previous two approaches and provides a mechanism for producing a reduced length pseudonym from a cryptographic hash function, though with a non-zero probability that the candidate pseudonym will clash with a value created by a different input. For this reason, a two-step approach is required: A candidate pseudonym is derived. There is then a need to check and resolve potential clashes in a way similar to that described above when a random number is used. That said, the probability of a clash typically varies from low to very low, allowing the implementation of efficient mechanisms to undertake the second step The output of the process is pseudo-random; see Appendix A for details The approach uses the fact that SQL server will convert a substring of up to 8 bytes taken from a binary (varbinary) field, such as that produced by the cryptographic hash function, and convert it to a corresponding integer (8 byte bigint) value, giving a candidate pseudonym of reduced length. The length of the pseudonym can be reduced further if necessary by expressing the integer to Base16 (hexadecimal) or to Base The size of the population from which the candidate pseudonym is drawn will depend on the number of bytes extracted but can be very large. Details are given in the table below: In consequence the power of the approach to obfuscate the data is high and the numbers of clashes which are likely to be generated are few. In the following cases, the number of bytes extracted was dictated by the need to provide an output of the same length as the clear text input (when expressed to Base32): Pseudonymising the 2.4m entries in the postcode directory and extracting a 5 byte integer gave rise to a minimum of 2 and a maximum of 3 clashes. Pseudonymising over 5 million NHS numbers 30 and extracting a 6 byte integer gave rise to no clashes. 29 Negative numbers are held as a complement. As a result, if a full 8 bytes are extracted approximately half of the numbers returned will be negative. Because of the way SQL Server pads shorter fields fewer bytes extracted and expressed as a bigint will always return a positive value. 30 This check was undertaken on live data by West Midlands Commissioning Service Agency. Page 22 of 42

23 Bytes Held as Bigint Minimum value Held as integer -2^63 4-2^31 Maximum value 2^32-1 2^40-1 (4,294,967,295) (1,099,511,627,775) 2^48-1 (281,474,976,710,655) 2^56-1 (72,057,594,037,927,935) 2^63-1 (9,223,372,036,854,775,807) 2^31-1 (2,147,483,647) Maximum length integer Length Base 16 Length Base Comment Useful for postcode Useful for NHS# * negative integers because negatives May give rise to held as complements * negative integers because negatives May give rise to held as complements Table 8 - Range and length of extracted varbinary data The approach was found to be the most effective approach to the production of short pseudonyms: It is relatively simple to code It is relatively fast Multiple pseudonymisations can be generated by specifying different salt values for the hash function. Because clashes are few they can be quickly resolved which has the advantage that a near definitive output can be obtained from a given input Examples of the SQL code required to support the use of this approach will be found in the code examples. Encryption as a mechanism for pseudonymisation In principle, it is possible to use a simple stream encryption as a basis for pseudonymisation, with the encrypted value acting as the pseudonym and de-pseudonymisation being effected by decryption The use of this approach is, however, deprecated, not least because the native functionality (which involves the use of RC4 and RC4_128 algorithms) which would support it has been deprecated in SQL server and will be removed by Microsoft some time in the future Other ciphers are implemented by a form of double encryption to ensure that the encrypted value of a given clear text varies from case to case. This has two consequences: Encryption is not a suitable basis for the direct algorithmic creation of pseudonyms. Indexes are not an effective mechanism to improve performance on the encrypted field as the output is not determinate While it is possible to sidestep these constraints to some degree by using a lookup process to store the first occurrence of the encrypted value and then use this as a pseudonym, this approach is not recommended: The process generates a very long pseudonym of indeterminate length for example the generated pseudonym for a 10 character NHS number is 20 bytes using an SHA-1 hash algorithm and between 66 bytes using AES-128 or AES-256. This is wasteful of Page 23 of 42

24 storage and likely to give rise to less efficient processing than the use of a hash-key which is shorter in length. Efficient processing of both updates and patient specific searches requires either that the look-up table is supplemented by the use of a hash function or that the source data is maintained unencrypted on the look-up table 31. Special case dates and other restricted sets When the range of potential values is relatively small and bounded, the key requirement is to ensure that the pseudonyms in use are drawn from a wide set of potential values Note there is a difference between the handling of dates and postcodes: Dates are static and the list of potential pseudonyms can be created as a one off solution. Postcodes are relatively static but not completely so. While it is possible to pseudonymise the whole PAF file, allowance must be made for the fact that the Post Office routinely add 4-5,000 new postcodes a month 32, so that incoming data must be routinely checked to ensure that the new additions are made to the reference table of postcode pseudonyms as required Other considerations Null values Special consideration needs to be given to the pseudonymised representation of NULL values within the clear text. The requirements can be summarised as follows: It should be possible to support joins between the pseudonymisation master table containing the root pseudonymisation and associated fields to other tables on the root pseudonym using a simple inner join. This implies that record should be included in the pseudonymisation master table with a value for the root pseudonym to cover the case where the input clear value is NULL. It should be possible to distinguish these cases from those where the input value is populated. This implies pre-populating the root pseudonym with a special value (e.g. 0), flagging the NULL case or both. In a transactional environment, user queries based on a public pseudonym (where allowed) should ensure that accidental joins are not made between records where the input value was NULL and ensure that null values are propagated in accordance with the rules of three value logic. This implies that the value of the public pseudonym should be set to NULL when the input value was NULL. In a dimensional (data warehousing) model, the usual approach is to use a specific label (such as Missing ) to flag NULL values. This step can be handled when the dimension tables are built An example is set out below: Clear Root Pseudonym/ Surrogate Key Public Pseudonym 1 (transactional) Public Pseudonym 2 (Dimensional) NULL 0 NULL Not Known 1 IsNull Flag (0 for other cases) In the past, the Post Office have re-used postcodes. More recently they have indicated that they will seek to avoid doing so, but this is not guaranteed. For this reason a pseudonymised postcode table should include applicable dates. Page 24 of 42

25 3.5.3 The approach can be extended to include cases where there a different classes of unknown e.g. in the cases of postcode where there are range of pseudo-postcodes for unknown cases 33, for example by flagging each case in the master table. Data quality and internal representation Inconsistently formatted data will generate inconsistent pseudonyms. Where it is intended to base pseudonymisation on an approach which involves a cryptographic hash function this extends to the need to ensure a consistent internal representation of the data The examples set out in Figure 3 have been generated by the simple approach to creating short pseudonyms discussed below and show that the pseudonyms can change as a result of apparently minor differences in format or of the internal representation of the data types used to present them for pseudonymisation. 34 Ref Field description Example Pseudonym P1 8 chars left/right justified (old postcode format) CW2 5GX Y6BQEB7R P2 8 chars left/right justified mixed case CW2 5gx G63EJCM8 P3 7 chars single space separator, no trailing space CW2 5GX 5H9BV0ZL P4 8 chars single space separator, trailing space CW2 5GX 8T22EGGF D1 Date held as character input format to pseudonym varchar format date W5Z4VD D2 Date held as datetime input format to pseudonym varchar date time :00: DTBA D3 Date held as datetime, but containing time as well as date information input format to pseudonym varchar date time :00:01 00QVPHY2 D4 Data held as datetime input format to pseudonym internal representation (varbinary) :00:00 00YPNETW T1 Maintained as int E6QQF6GW T2 Maintained as bigint Y39TK2G T3 Maintained as varchar HJR26Z86 T4 Maintained as nvarchar VJ1DEHT Figure 3 - Different input formats can give rise to different pseudonyms Particular problem cases include: Inconsistent formatting of postcodes maintained in local systems. This is a particular concern given that the standard for maintaining postcodes differs between the historic 33 See 34 The code used to generate examples is referenced in Appendix 3 as Code Sample 16. Page 25 of 42

26 HES format in which the outbound post code was left adjusted and the inbound code right adjusted (case P1) to that set by BS766/GIS in which the whole code is left adjusted with a single space as separator (case P3). (See 1, p42 for sample code to address the issue) Cases where source systems generate times as well as dates. For example, a PAS may maintain the date of birth as in example D2, while a maternity system will maintain the code at D3. (See 2, p42 for sample code to address the issue) Inconsistent handling of trailing blanks (cases P3 and P4). NHS numbers, where the NHS number check digit may or may not be provided by the source system and there is a possibility that data is maintained in external (3-3-4) rather than internal format. (See 3, p42 for relevant code) Local Patient Identifiers, where formatting of the same Identifier may differ between systems (e.g. use of blanks v. - as a separator). Inconsistent formatting of character fields cases P1 and P2 show the difference for a postcode, but the problem can occur in other contexts. 35 Cases D4 and T1- T4 relate to local processing rather than data quality and show that in some cases results can be dependent on data handling. A particular risk is where an implicit conversion takes place but is not as expected. The answer is to set standards, document them and always explicitly cast or convert data before pseudonymisation. This should be the case even where the input presentation is known and no change appears to be required to ensure that later changes do not give rise to unexpected inconsistencies. Re-identification There is a distinction to be made between those cases where there is a requirement to view data for all patients in clear and those where the need is to view only a limited subset of cases which have been identified through analysis. The first case will normally be limited to information provided to the patients GPs and others who have a direct relationship with the patients concerned. The second case includes people who have been identified as being at risk from the analysis of data (e.g through PARR+), leading to active intervention to improve expected outcomes (e.g. through a virtual ward). It also includes cases where some personal identification is required to enable a PCT to engage in dialogue with a Trust about contracting for a patient of for a group of patients The preferred approach in the first case is to maintain the information in a separate environment. If this is not readily achieved, then it is important that access controls ensure that the ability to access information is strictly limited. The use of encryption as a means to provide additional controls on the ability to access clear text can be a useful supporting technology in this case and is discussed further below The second set should be handled through an explicit re-identification process which should be explicitly logged. Consideration should be given to limiting the number of cases which can be de-pseudonymised at any one time when there is no expectation that this will be required for other than a small subset of cases. Encryption It is not necessary to always encrypt the clear value held against root and secondary pseudonyms within pseudonymisation master tables. This decision should be made on the basis of an assessment of risk. Highly specific and readily applied identifiers should always 35 Note that algorithmic approaches to the creation of pseudonyms will reflect the binary content of the field and will always give different pseudonyms where character fields contain the same text but in a differing mix of cases irrespective of the collation used. Page 26 of 42

27 be encrypted, thus full name and address data if held should be encrypted, though exceptional justification would be required to maintain these at all. The combination (House Number, Postcode) and geocodes should also be encrypted if held Internally, where a salted hash is used as the basis for pseudonymisation, the salt values should be encrypted and accessed from the routines which use them; they should not be hard coded within the application The following observations are relevant to the use of encryption: If encryption is implemented, account should be taken of the impact of encryption on index performance 36. An attempt to decrypt a field requires that the relevant key has previously been opened; otherwise the decryption will return a NULL value. This provides a potential mechanism to enhance access control 37, while the action of opening the key can also be used to trigger an audit entry It is important to take account of this behaviour when recovering a salt or encrypting values, as a failure to open keys successfully will not necessarily automatically generate a failed process. In these circumstances results will generally be unpredictable and will depend on the detail of the code but potential outcomes include the return of a null instead of an encrypted value or the creation of inconsistent pseudonymisation. The recommendation is therefore to check that keys have opened successfully before running a process which is dependent on them and to raise an error if they have not. One simple approach is to check that a constant value when encrypted and decrypted using the key returns the starting value prior to undertaking operations see the code samples for example coding. Other observations The following observations have been made in early discussions around this document. It is stressed that while relevant and useful, they are not intended to be all encompassing: Transient tables containing clear data should be cleared at the earliest possible opportunity and the security of source data should be specifically addressed. Backups and archived data should be encrypted. User/schema separation and database rolescan be used to ring fence data and control rights; all end user access to data in clear should be through stored procedures and other data should be access by views or stored procedures. As part of this, tables holding clear data and operations relating to pseudonymisation and depseudonymisation should be maintained in a distinct schema and accessed by distinct database roles 38. That said, it is accepted that the implementation of this approach within an established system is non-trivial and other approaches can provide an alternative means of providing separation. For example, some users have done so by created separate instances of the database. No users should have the ability to access base tables. No rights should be granted through the user or public roles. While rights should normally be granted through membership of database roles, some security functions (and particularly EXECUTE AS) can only be exercised by referencing an individual. Specific virtual users should be set up and maintained for this purpose rather than using accounts associated with real individuals. 36 See In this context it is important that developers have an understanding of the implications of ownership chaining See Books On Line (BOL) Page 27 of 42

28 Where elevated rights are required e.g. to allow a certificate to be used to decrypt a key for later use, rights should be assigned to a user maintained for that purpose and stored procedure used to undertake the action using the EXECUTE AS option. Data recovery issues should be explicitly addressed. Page 28 of 42

29 4 Implementation with SQL Server 4.1 Preparing data for pseudonymisation As Table 8 indicates, great care needs to be taken around the preparation of inputs to ensure that pseudonymisation produces a consistent output. In the case of a cryptographic hash the need for consistency extends beyond external format and it is important to ensure that the internal binary representation of the data is consistent Issues to be considered should include the following: Avoid converting integer and big integer values directly to varbinary as these differ in their internal representation (see Table 8, examples T1 and T2). Varchar and nvarchar differ in their internal representation and the same data will give different outputs accordingly (see Table 8, examples T3 and T4) Datetime and smalldatetime differ in their internal representation and outputs differ accordingly (as does the SQL 2008 date type) The existence of leading or trailing spaces will give rise to different outputs. The existence of non-printable characters will give rise to different outputs. There is a need to decide whether or not the NHS number is to be stored with its check digit and functionality needs to be put in place to ensure that it is maintained consistently The recommended approach is therefore to: Convert integers to nvarchar input prior to input to the hashbytes function and always left and right trim the result. If a hash function is to be used to pseudonymise dates, always convert dates into standard datetime representation such as ISO 39, removing the time component unless there is a specific requirement to use it for matching. Check for and remove non-printable characters if there is a risk that these are present. Despite the lower performance associated with the approach, build standard conversions into a set of scalar functions and always use these to undertake pseudonymisation Data quality issues are not evident from data once it has been pseudonymised. For this reason data quality routines should be run prior to pseudonymisation. The decision on whether or not to create a pseudonym when a data quality check fails is a local one, but a value which fails to meet standards (e.g. a NHS number which does not meet the modulus 11 check) should always be flagged as in error. 4.2 Pseudonymisation methods Creating random numbers As part of the research undertaken to produce this paper, an analysis was undertaken of the mechanisms available to produce a pseudo-random distribution without resorting to external routines 40. This is detailed in Appendix 2 and provides the background to the pseudonymisation approach set out in this paper. 39 Style value 121 e.g CONVERT(varchar(8),@indate,121) 40 This should not prevent those capable of using this approach from doing so, while SQL Server 2000 have few alternatives. Page 29 of 42

30 The cryptographic hash function A key component of the approach is the use of a cryptographic hash function is a starting point for the approaches discussed below. The form of the function is: Hashbytes ( <algorithm>,{@input} input }) Where the potential input values for <algorithm are: MD2 MD4 MD5 SHA SHA1 note that the single quotes are required Input is varchar, nvarchar or varbinary Output is varbinary of 16 bytes for MD2, MD4 and MD5 and 20 bytes for SHA and SHA1 Figure 4 - The hashbytes function The function is fast, particularly when the input is short. For example, 850,000 randomly produced 10 character pseudo-nhs Numbers were processed within 14 seconds on a standard laptop41 using SHA1. Although theory suggests that even faster performance would be delivered by the simpler MD5 algorithm, this was not evident over the volumes checked The input to the hash function should always be salted by appending a static string. There are two reasons for this: The approach minimises the risk of dictionary attack, where values are matched against pre-computed values 42. This is of particular importance in respect of dates and postcodes, where the range of potential values is known and not over large. For example, if it is known that a postcode has been pseudonymised by applying a hash, then it is a simple process to create an external map by applying the same process to the postcode directory unless a salt has been applied. The approach allows the implementation of different pseudonymisations, both when required to support operations and to allow the production system to use a different set of pseudonyms to that generated in test and development In the present context, it is sufficient to append the salt to the value being pseudonymised 43. e.g.: nvarchar(20) of salt>. NHS_NUMBER_PSEUDONYM_ROOT =hashbytes( SHA1,cast(NHS_NUMBER as nvarchar(10))+@saltnhsnoroot) Note; this example omits DQ processing on the inbound clear text Figure 5 - Example of code to create a salted pseudonym Salt values should be kept secret. Although the salt value in the above example is shown as if hard coded, this is for purposes of clarity only. Doing so is poor practice as it increases the risk that values can be discovered and is also less flexible. The recommended approach is to store the salts as encrypted values and recover them when needed. 41 Dell 4300, Twin core 32 bit processor, 4GB memory 42 It is easy to find a web site boasting the ability to look-up against MD5 hash values which is claimed to cover all 6 character inputs plus many common passwords. 43 Where the output is integer, there is a case for using the XOR function, particularly when the input values are sequenced see Code example 11 Page 30 of 42

31 4.2.7 Note that hashbytes function will yield a NULL value for a NULL input and the above code will have this result if the salt is null or input value is null 44. Because of the risk that problems with opening keys 45 will cause the decryption to return a NULL value for the salt, returned salt values should always be tested against this possibility prior to their use and an error raised in the event that it occurs. Creating a root pseudonym A cryptographic hash has a number of attractions as a mechanism for creating the root pseudonym. However, in terms of physical design, its use to generate a value for use as a primary has significant drawback. The problem is while that the ideal root pseudonym would be randomly related to the clear data, for performance purposes a physical SQL Server database should be built around tables where the primary keys are short (typically integer), unique values which are incremented on update. 46 The direct use of a value generated by a cryptographic hash does not meet the latter requirements. Similarly, the use of long surrogate keys is to be avoided within dimensional design Further discussion of these issues will be found in Section 4.6. The need to keep the root pseudonym short in this case can be met in one of two ways: By using the extract from hash approach described below, using an extract length of 7 or 8 bytes to populate a bigint value. 48 Using the cryptographic hash to randomise the sequence of data on initial load to reduce the risk of values being inferred from sequencing, and using the identify function for later additions on the assumption that these are broadly random in their arrival The best approach will depend on the specific case. However, The use of a randomised value should be avoided when large frequent updates are expected because of the performance impact. However, where the relevant table will be largely populated by the initial load the approach may be acceptable, given that the systems in question are not transactional in nature. 49 The use of the identity function should be avoided when there is either a potential need to be able to create a consistent pseudonym across multiple systems or there is a risk that data will be created to the application in a discernable sequence. (Dates are a special case and are discussed below.) 4.3 Extracting a value from a hash to create a public pseudonym The basis of this approach has already been discussed (para on). The modification to the code example above to extract a fragment as the basis for a candidate short public pseudonym is slight: 44 More correctly, when a salt is applied by appending a string, whether a NULL in the source generates will be dependent on the setting of the parameter CONCAT_NULL_YIELDS_NULL. If this is set to ON, the returned value will be NULL the input will be the value of the salt, As future releases of SQL Server will remove the option to set the value to off, good practice is always to set CONCAT_NULL_YIELDS_NULL. to ON as Microsoft have indicated that future releases of SQL Server will remove the option to set the value to OFF. In the event that production of a non-null result is required behaviour, this should be handled explicitly using the ISNULL or COALESCE functions 45 Typically caused by the user concerned having been granted insufficient rights 46 To be strictly correct, these performance issues relate to the cluster key, which is usually though not necessarily the same as the table s primary key. For convenience these have been conflated for a fuller discussion see 47 See, for example, The Microsoft Data Warehouse Tookit, Mundy, Thornthawaite, Kimball Wiley 2006, p Note that an 8 byte extract will give a negative result around 50% of the time as SQL server uses two s complement to store negative values. This will normally be acceptable for a root pseudonym, where there is no requirement to transform the data further. 49 The impact may also be mitigated by the specification of a fill factor when the index is created or rebuilt. See BoL. Page 31 of 42

32 nvarchar(20) of salt>. NHS_NUMBER_PSEUDONYM_PUBLIC_1 =substring( hashbytes( SHA1,cast(NHS_NUMBER as Note; the comments regarding the previous example apply equally to this Figure 6 - Extracting a fragment from a hashed string as a candidate public pseudonym The output of the code in Figure 6 can be converted to a bigint data type within the range set out in Table 8 according to the length of the fragment set by the parameters of the substring function It is important to note that different and unexpected results will occur if the extracted data is converted to a binary rather than varbinary type where the field length of the binary type does not match of the extracted substring. The reason is clear from, Figure 7 which reflects alternative handling of the same extracted substring. If first converted into an over-size binary data type the binary will be right padded with zeros which will give rise to a large or negative value. Direct conversion to a bigint or conversion from varbinary results in left padding and the expected result.as set out in Table 8. Representation A Converted to 8 byte binary then bigint B Converted to 8 byte varbinary then bigint C Converted direct to bigint Internal 8E.6D.93.3E.C E.6D.93.3E.C8 External Figure 7 - Beware of binary data types when extracting fragments The number of characters in the output, which has a pseudo-random distribution (see Appendix 2) can be further compressed by using Base16 (Hexadecimal) or Base32. When account is taken of this, extract lengths of 5 are found to be appropriate in respect of the postcode and 6 in respect of NHS number. Length at Base32 Extract length Min Postcode NHS Number Range Max 2^40-1 (1,099,511,627,775) 2^48-1 (281,474,976,710,655) Table 9 - Extract lengths to support postcode and NHS Number Extracting a fragment from a salted hash produces a candidate pseudonym. However, because of the reduced size of the pool of potential values there is the possibility of a clash with a pseudonym already created for a different clear value. These cases need to be identified and resolved. In the case of NHS numbers, the large size of the pool generated by extracting a six byte value implies that clashes will be very rare none were found in a test undertaken on over 5m distinct NHS numbers. 50 Note that the substring is not constrained to start at position 1. Page 32 of 42

33 Clashes can be expected in the case of postcodes where the pool is smaller, though these will be few applying the approach to the 2.4m postcodes in PAF give rise to 2 to 3 cases of duplication Although as a point of principle code should support iterative checking and adjustment, we have yet to find a case where more than one pass is required to remove any identified clashes It should be noted that the use of Base32 to present the data implies that the NHS number will include alphanumeric characters. Adjustments to the approach are required if output must be in accordance with the data dictionary standard for an NHS number and an option addressing this case is set out in the code samples 52. However, this option should be avoided if possible as: The resulting pseudonym is not readily distinguished from real NHS Numbers and will overlap them. Pseudonyms are drawn from a far smaller pool % of the size of that using the Base32 approach. In consequence there is a higher probability of collisions being created which must then be resolved a run on 400,000 distinct pseudo NHS numbers resulted in an average of 11 duplications. 4.4 Putting it together The low number of clashes means that the efficient approach to processing is to create the root and a set of candidate public pseudonyms through set based processing, identify any clashes/duplicates and handle these using cursor based processing, changing the pseudonym iteratively until there are no outstanding duplicates. Where the root pseudonym is based on a full cryptographic hash, the initial creation of root and public pseudonyms can be successfully created in a single pass, leaving further (cursor based) processing to the few cases identified as duplicates This approach has proved to be reasonably fast, the creation of root and three public pseudonyms for the postcodes in the postcode address file taking 7 mins and the pseudonymisation of 5m distinct NHS numbers taking 9 mins in a development environment. (Clearly the times required to pseudonymise incremental updates will be significantly less.) Code examples covering these cases will be found in Appendix The following mechanisms were found to work efficiently in creating new pseudonyms in cases where clashes were identified: Clean and reformat input Check and flag DQ cases Identify values with no previously identified pseudonym Derive missing pseudonyms Recover Salts Derive root and candidate pubic pseudonyms; Resolve any clashes Update master with root and public pseudonyms Apply root pseudonym to source data Remove clear data from source Figure 8 - Pseudonymisation process 51 The number varies slightly according to the salt value. 52 See code sample 9. Page 33 of 42

34 Increment the value of the starting point for extracting the fragment until no duplicates have been found to the derived pseudonym. Concatenate the clear source (or elements of the clear source) which has given rise to a clash beck onto themselves and regenerate the pseudonym. An example is given below. PASS SA5 8YE BD12 8NW 1 SA5 8YE NTHJJH15 BD12 8NW NTHJJH15 2 SA5 8YE + SA5 8YE 3GG8VTDW BD12 8NW + BD12 8NW HGKD2C5W In both cases the process is iterated until the clash is removed, although no cases requiring more than a second pass have occurred in testing The key point is that once the pseudonyms have been created, the process is then to recover existing pseudonyms via a look-up process to the pseudonymisation master table. The ability to regenerate pseudonyms from source data is primarily a support to disaster recovery, although it can with careful management also be used where a common pseudonymisation is to be implemented across multiple systems. 4.5 Pseudonymising dates Dates are a special case, because the starting values are known, ordering is sequential and the number of cases under consideration are few: there are fewer than 50,000 days between 1890 and In addition, it is sometimes possible to derive key dates by inference for example the data of admission of a new birth will probably also be the date of birth For this reason, effective pseudonymisation is problematic. It is important therefore that the exposure of pseudonymised dates to end users is minimised. This may require effort to extend the range of derivations available to users In the case of date pseudonymisation, the preferred approach is to use a pre-populated table containing (pseudo-) randomly distributed numbers drawn from a large population. The approach adopted in the code example 53 has been as follows: Generate a sequential list of numbers covering the period ; Use these to create four sets of random numbers Delete any record where one or other of the four random numbers is a duplicate of a number in the same set and re-sequence. RecordID RootPseudo SecPseudo1 SecPseudo2 SecPseudo3 Re sequence See code example 10 Page 34 of 42

35 Create date sequence from re-sequenced records RootPseudo PubPseudo1 PubPseudo2 PubPseudo3 Date st Jan nd Jan rd Jan th jan 1890 Public pseudonyms can be presented as alphanumeric to base32 if as an alternative presentation if requred RootPseudo Date PubPseudo3 (bigint) st Jan 1890 V0PS8J nd Jan 1890 LL43NT7B rd Jan AFS72JJ th Jan 1890 V0PS8J75 Note: only PubPseudo3 shown for clarity Note that the actual number of clashes when implemented in the real world is few The same approach can be used to generate alternate dates as pseudonyms. Alternate dates bear no relation to the input value but are presented in date format. While it can help in specific cases, where the pseudonym must replicate the format of the clear data precisely, it is to be avoided if at all possible, because of the risk that attempts will be made to base derivations on the alternative date values. Alternate dates are implemented using the methodology set out above, but the extract length is restricted to binary(4) and the result cast as a datetime. 4.6 Indexes By definition, a good pseudonym has a random relationship to the clear data which it reflects. In addition, there should be no correlation between different pseudonymisations. However, as indicated in earlier discussion, a long and randomised variable will give rise to poor performance when used as a cluster key. There are two distinct issues: Because of the internal workings of SQL Server, the cluster key is a component not only of the cluster index but will be a component of all other indexes built on the table. The use of a long key can cause substantial inflation in data volumes in consequence. Significant problems can also be used by the use of over-long keys as surrogate keys to link dimension and FACT tables in relational models for similar reasons. For SQL Server, a table with a clustered index is physically ordered on the cluster key. A randomly distributed cluster key will tend to cause the index and table to be fragmented on update Key length is a significant issue in these cases. The impact of fragmentation is normally be a lesser concern in the case of business intelligence, such as those under discussion here: updates are less frequent than for transactional processing, user access is in read-only mode and the database can be taken off-line for a period overnight and weekends to rebuild indexes That said, the best approach is generally to use an identity as the cluster key. Where requirements indicate otherwise (see ) and a randomised value is to be used for the root and as cluster key, consideration should be given to specifying the fill factor and pad index explicitly In addition to their role in maintaining performance, correctly specified indexes are also an important mechanism to ensure the integrity of the pseudonym pseudonymisation master table(s). 54 For further details, see BoL. Page 35 of 42

36 The root pseudonym will invariably be a unique and non-null key so will not support duplicate values. Public pseudonyms should be covered by a unique index. This will not allow duplicates, but will allow a single null value unless the column is declared as NOT NULL. This setting should reflect the approach adopted in light of the discussion at The final check on ensuring the integrity of the tables when indexes are specified in this way is to ensure that the total number of new entries in the pseudonymisation master table is equal to the number of unique clear values submitted for pseudonymisation Although the wider impact of indexing on performance is outside the scope of this paper, foreign keys in data tables will usually need to be indexed to provide performance, though this will depend on access path. Users should be aware of the performance benefits which can arise by using the INCLUDE clause to create a covering index in certain circumstances See for a general discussion and BoL for details. Page 36 of 42

37 Appendix 1 Other useful sources Encryption and hash functions Cryptography in SQL Server Encryption hierarchy Practical observations on encryption Cryptographic Hash Using the hash function to obscure data Converting binary and varbinary Data Other items referenced in this document Implementing Row- and Cell-Level Security in Classified Databases Using SQL Server Indexes General overview of indexes SQL Server 2005 Security Best Practices Operational and Administrative Tasks ct=clnk&gl=uk SQL Server 2005 Best Practices Analyzer Microsoft Baseline Security Analyzer Microsoft Security Assessment Tool Security Compliance Management Toolkit Series Page 37 of 42

38 US Department of Defence Checklist and Code SQL Server 2005 Integration Services Security Considerations for Integration Services Integration Services - Setting the Protection Level of Packages Page 38 of 42

39 Appendix 2 Mechanisms for Creating Random Numbers 1. Mechanisms for obtaining a value which is randomly associated with the source data include: The SQL Server Identity function The SQL Server Unique Identifier functions The rand function The CRYPT_GEN_RANDOM function (SQL Server 2008 only) A sub-string generated by taking a segment of a cryptographic hash 2. The characteristics of different approaches are summarised below; the basis of the analysis is set out in Code example Analysis of alternative approaches to creating pseudo random numbers: Approach Identity function Figure 9 - Approaches to Generating pseudo-random Numberss compared N of duplicated records / Global Unique Identifier function NEWID() 0 0 Description Produces as a string which is incremented by 1 or other constant for each new record. While clearly not random in itself. may be regarded as having a pseudo-random relationship to related fields provided that records can be safely regarded as being added randomly to the table. Designed to provide a unique identifier which is truly unique. Implemented as a default derivation to a table column of type uniqueidentifier Comment Short length (4 or 8 bytes integer) Has the best characteristic for generating a primary clustered key. No need to check for and remove clashes Cannot be reliably recreated from source data Only one Identity allowed for each table No need to check for and remove clashes Relatively long (16 bytes) - produces value rather than number Pseudo random Cannot be recreated from source data Rand() function, unseeded via set processing 1,000,000 Rand() function, seeded at record 518 Produces a quasi-random value between 0 and 1 each time it is called. If called without a record specific seed as part of set processing will assume the same value for all records. The rand function is applied to an integer seed which varies for each value to be pseudonymised. For a static input duplicates can be removed by brute force before the random set is used. Where new pseudonyms are required on a case by case there is a need to test that the random value has not been previously issued and obtain a new value if this is the case Example included to demonstrate that Rand() is not useful within set processing Will not automatically generate a unique number; number of clashes will depend in part on the characteristics of the seed. Need to check for and remove clashes before use; Length is determined by the processing of the random number Quality of randomness has been questioned Very low chance of recovery from source data unless integer used to seed is available Page 39 of 42

40 Approach Figure 9 - Approaches to Generating pseudo-random Numberss compared CRYPT_GEN_RA NDOM function N of duplicated records / Description True cryptographic hash introduced in SQL Server Pseudo random each time called and cannot be seeded to predetermine the numbers which will be issued Comment Will not automatically generate a unique number; Need to check for and remove clashes before use; Length is determined by input parameter No ability to recover pseudonym from source value 4 Byte subset of Hash function (can be converted to integer) 262 Four bytes are recovered from the hash of a unique input value. These can be cast as an integer. Because the hash is a one way function which can be applied to a text string and distribution is believed random there is the potential to use this as a random number Will not automatically generate a unique number; Need to check for and remove clashes before use; Length is four bytes and may encompass +ve and ve values in the absence of additional processing Potential to partially recover pseudonym from source data when input and salt are known though in the example considered here it would not be possible to do so for.03% of cases 5 Byte subset Hash function 2 See above gives an 8 character output if presented using Base32 (see below with regard to formatting) Will not automatically generate a unique number; Need to check for and remove clashes before use; Length at base 32 is 8 characters suitable for postcode short pseudonym Potential to partially recover pseudonym from source data when input and salt are known is very high very few clashes are created and depending on processing these may capable of resolution 8 Byte subset of Hash function converted to big integer 0 Eight bytes are recovered from the hash of an input value. These can be cast as bigint As above, but the numbers are generated across a wider range 56 Not guaranteed to generate a unique number, though probability of this appears very good for other than very large sets. Need to check for and remove clashes before use depends on context; Length is eight bytes and may encompass +ve and ve values in the absence of additional processing. This translates t o a 20 character number which can be shrunk to 14 characters using Base 32 to present Potential to recover pseudonym from source data when input and salt are known is very good Internal representation is guaranteed consistent within a given version, but impact of change to version must be checked (note: no checks have been made comparing 64 and 32 bit implementation) 56 See analysis on p 38 for analysis of the specific characteristics of this approach Page 40 of 42

41 Approach Figure 9 - Approaches to Generating pseudo-random Numberss compared N of duplicated records / Byte subset of Hash function converted to big integer 0 Description As above except that only 6 bytes of the output from the hash is stored as bigint, limiting the maximum value Comment See above; clearly the reduction in range will increase the risk of clash and checks should always be made where used as a root. If presented using Base 32 outputs has an output length of 10 suitable for NHS number pseudonymisation Very good prospect of recovering pseudonym providing input values and salt are known 3. The table below details a check on the distribution of integers extracted as a sub-string from a varbinary generated by a hash function when applied to a sequence running from 1 to 1,000,000.. In the table below, conversion is to integer from a 6 byte extract.. 4. Figure 10 shows the results when 20 bins of equal range are created to cover the output values of the pseudonymisation in terms of the minimum and maximum values of the inputs which generated and the number of input cases which fall into each bin. That there is no obvious correlation between the distribution of input and output; There is a boradly even distribution of input cases in each output bin Figure 10 - Distribution of integer subset of hash function over 1,000,000 cases step Outputs Inputs Min Max Range Min Val Rank Max Val Rank Cases E E E , , E E E ,000, , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , , E E E , ,766 Rang:e max bin value min bin value for each measure The code to produce this histogram is referenced as item 18 in Appendix 3. Page 41 of 42

42 Appendix 3 List of code samples Code examples are listed below and are available separately (Ref 6) to allow access to the detail of the SQL code. Code examples are provided to illustrate the narrative only and no responsibility is taken for the correctness or reliability of their operation; it is the absolute responsibility of the user to test their operation in the context of a proposed solution. Data quality and cleaning 1 Postcode formatting and Validation - Code use to validate and consistently reformat full and part postcodes (3 routines) to BS7666/GIS 2 Cleaning date fields Code to clean datetime fields 3 NHS Number Check Digit Code to calculate the NHS Number (modulus 11) check digit. Encryption 4 Encryption setup 5 Code to maintain and recover encrypted salt values Pseudonymisation 6 Exemplification A generic pseudonymisation function 7 Exemplification pseudonymising the postcode directory and subsequent maintenance when new postcodes presented 8 NHS Number exemplification 9 Special case pseudonyms where the NHS number pseudonym must be numeric 10 Dates Code to pre populate an extended date table Presentation of pseudonyms 11 Function to format varbinary as text Hexadecimal 12 Functions to format big integer to Base32 13 Generic function pseudonymisation and Base32 presentation MS-Access 14 Using SQL Server Express to pseudonymise MS-Access data Re identification 15 Code to recover secondary pseudonym by reference to root pseudonym Code used to support the analysis in the document 16 Example: different input formats give rise to different pseudonyms. 17 Analysis of alternative approaches to creating pseudo random numbers 18 Code to create histogram of output v input distribution Page 42 of 42