Test Data Protection Providing Secure Representative Data Sets By Dr. Ron Indeck VelociData Inc. - www.velocidata.com World Headquarters 321 North Clark Street, Suite 740 Chicago, IL 60654 Telephone: 312-600-4422 Research & Development 349 Marshall Ave, Suite 302 St Louis, MO 63119 Telephone: 314-499-8984
Contents 1. VelociData Enterprise Streaming Compute Appliance (ESCA) 1 2. Test Data 1 3. Creating a Representative Model Copy 2 3.1 Challenges in Creating the Model Copy 3 4. VelociData TDP 4 5. Format Preserving Masking 4 6. Deterministic Masking 6 7. TDP Use Cases 7 7.1 Use Case 1: Creating a secure, HIPAA-compliant full production dataset from Microsoft SQL Server 8 7.2 Use Case 2: Secure data for insertion into an Azure cloud 9 7.3 Use Case 3: Securing test data for off-shore developers 10 7.4 Use Case 4: Creating daily datasets for Development, QA, and Test Integration 10 8. Summary 10 9. Let Us Help You 11 All Content 2015 Velocidata inc.
1. VelociData Enterprise Streaming Compute Appliance (ESCA) The VelociData Enterprise Streaming Compute Appliance (ESCA) is the result of over two decades of development and the deployment of hundreds of systems in the most demanding IT environments. The system comprises a unique combination of components in a system that is dedicated to high performance processing of streaming and serial information. Figure 1: The first enterprise streaming compute appliance. Cloud Production Databases Sensitive Data Protection Streaming Data Ingestion Batch Process Delegation Mainframe Enterprise Data Warehouse Streaming Data Masking, Encryption, Transformation & Distribution Application Servers Cloud Hadoop HDFS This white paper focuses on using ESCA to protect sensitive data when it is used for testing software applications. To do this, the data must be rendered unusable but still retain their format (e.g., obfuscated telephone numbers will still be 10 ASCII digits), their volume (no-specific subsetting is required), and their relation (fields will still join properly). These processes can be applied as the data moves from source to target and representative model copies can be different for different targets without slowing the data down. 2. Test Data Development, testing, and quality assurance groups need access to data to build and test applications. For better, more rapid development, that data needs to look and feel like real production data. In many organizations, the way they achieve that look and feel is by copying over production data directly. This is acceptable for some data sets, but when the production environment holds PHI, PCI, or any other PII data, this exposes the company to unnecessary risk, including: Exposing sensitive data to a (drastically) broader set of users provides greater opportunity for breaches due to social engineering IT organizations need to manage and secure more user accounts, more data centers or network segments, and more copies of data at rest All Content 2015 Velocidata inc. Test Data Protection p.1
As an alternative, organizations could offer anyone who doesn t truly need the production data access to a Model Copy that holds the key characteristics of the production data, yet doesn t carry any true personally identifying information. To offer this in an effective way, it s important to differentiate between systems or users that need access to actual production data, or a representative model copy: Table 2: Example Data Needs Production Data Transactional Systems Billing Systems Fraud Detection Applications Reporting (user specific) Characteristic Data (Model Copy) Analytics Application Development Testing / QA Reporting (general reports) Proof of Concept / Evaluation Projects The key characteristics of model copies of data is that they must be representative in data character, distribution, and volume, and they must be fast and easy to generate. When these are generated quickly and easily, administrators can strictly limit access to raw production data, while being able to safely and easily provide representative data to a broad set of users. This provides several benefits, including: Less need for limiting user access, compensating controls, securing environments, etc. Less pressure for exposing production data into different development groups (especially when the model copy very closely mirrors the production data) Faster, more productive development, QA, integration testing, etc. 3. Creating a Representative Model Copy One of the best ways to generate a truly representative model copy is to perform a selective, deterministic, format-preserving masking operation on the raw production data to generate a derived output. This will ensure that test data will very closely mirror production for many different purposes. Representative: The test data is derived table for table, row for row from the production data Selective: Any sensitive fields (e.g., PHI) within those tables are masked using a NIST standard algorithm Deterministic: All similar input fields will map to the same masked output value such that correlations and joins can match on the same keys Format-Preserving: Output records must maintain the same data format (text, phone numbers, social security numbers, dates, etc.) When all of these conditions are met, testing environments can use the same database schemas, the same testing algorithms, run the same processing operations, and observe the same volumes and capacities that will be observed in the production environment. All Content 2015 Velocidata inc. Test Data Protection p.2
Figure 2: Test Data Protection 3.1 Challenges in Creating the Model Copy There are several concerns with the current solutions in the market that make creating a true model copy in an effective manner challenging: 1. Formatting or Schema changes Many masking solutions require changes to the format of the data elements when encrypting or masking the data 2. Lack of Deterministic Behavior Many simple masking solutions perform pseudo-random operations on the data to mask it, breaking the ability to perform correlations / aggregations / etc. 3. Limited Performance Most software vendors that provide format preserving encryption only transform a few hundred fields per second, which makes large data copies infeasible given typical time windows. 4. Lack of Tool Integration Many masking solutions are not integrated into data movement / data transformation components, requiring the users to create complex multi-product multi step jobs 5. Hard to Use Interfaces Most solutions require complicated tools to access masking functionality 6. Discovery Challenges Identifying PHI / PII elements is often a time-consuming chore 7. Insufficient Throughput Inability to perform daily refreshes or offer production-sized volumes for stress and performance which often results in data sub-setting vs. full model copies All Content 2015 Velocidata inc. Test Data Protection p.3
4. VelociData TDP VelociData offers a solution that can perform format-preserving masking while facilitating data movement / data transformations required to move data between production and test / development environments. This solution includes: Table 2: VelociData TDP Feature Format Preserving Masking (static and dynamic) Description Ability to de-identify data without changing its characteristics (permanent and reversible) Note that both static and dynamic operations are fully deterministic Hashing (MD5, SHA-2) Field Redaction Data Transformation Lookup / Replace Combine multiple input fields into a hashed surrogate key that can be used for tokenization Ability to remove / clear sensitive data elements that are not required for the model copy Ability to connect to a wide variety of data sources and to transform data formats in between (e.g. mainframe EBCDIC to ASCII) Ability to perform lookup-based replacements of sensitive terms with non-sensitive values 5. Format Preserving Masking VelociData offers a format preserving masking or format preserving encryption option that conforms to the NIST 800-38G standard. This solution can mask or encrypt data without changing the format of the fields. This means that a credit card number that is stored as 16 ASCII numeric digits can be deterministically masked into 16 ASCII numeric digits. A varchar name field in the database can be masked or encrypted into an equivalent number of alphabetic characters. All Content 2015 Velocidata inc. Test Data Protection p.4
Figure 3: Example Masking This format preserving characteristic allows users to fully secure their data without needing to change the database schema of development or testing systems. Below are the sets of field types currently supported or in development by VelociData: Table 3: VelociData Masking Data Types Value Description name All alphabetic characters and hyphens numeric ASCII numeric digits: 0-9 alphabetic Upper and lowercase characters: a-z and A-Z alphabetic_uppercase All upper case alphabetic characters: A-Z alphabetic_lowercase All lower case alphabetic characters: a-z alphanumeric All alphabetic characters and base 10 digits: a-z, A-Z, 0-9 alphanumeric_uppercase All upper case alphabetic characters and base 10 digits: A-Z and 0-9 alphanumeric_lowercase All lower case alphabetic characters and base 10 digits: a-z and 0-9 hex_uppercase ASCII numeric digits 0-9 and letters A-F hex_lowercase ASCII numeric digits 0-9 and letters a-f date Dates in ASCII numbers, in the format YYYYMMDD printable All printable ASCII characters everything The full set of ASCII characters mailing_address In Development- Ability to mask addresses into valid USPS mailing address output All Content 2015 Velocidata inc. Test Data Protection p.5
Also note that VelociData s performance allows for data to be masked or encrypted at 10 million fields per second. (Where competing solutions can handle hundreds or thousands of fields per second) As many fields are encrypted out of each record in your data set, this means the difference between trickling records through the system in dozens per second or moving data through at hundreds of thousands of records per second. When production data sets contain millions or billions of records, this could mean the difference between being forced to mask only a small subset of your data or being able to mask the entire data set in a matter of minutes. 6. Deterministic Masking Note that the nature of masking is critical in ensuring that data in the model copy are truly representative of your source data set. To clarify what that means, consider the diagram below: Figure 4: Deterministic Masking Notice in this case that John is masked to id Hw each time it is observed in the data, and notice that the patient s SSN is masked to the same output value every time, even when looking at multiple different tables. This allows data sets to be joined and correlated, even when the join keys are being masked. This is a strong feature to consider when choosing a masking solution. Another feature of the VelociData system is the choice between one-way obfuscation versus reversible processing. For most applications involving model copies for test environments there is no need to ever reverse the process and recover the origi- All Content 2015 Velocidata inc. Test Data Protection p.6
nal information. In the rare circumstances where the original data need to be recovered, VelociData works with key management systems to enable reversible processing when required. These methods and modes can all be accommodated on data in flight passing through the network or on static data at rest headed for data stores including data warehouses and HDFS. Table 4: VelociData Data Masking Processing Types Form of Obfuscation Redaction/removal Scrambling/shuffling Replacement/substitution Hashing Encryption Format-preserving Encryption Description Removing original information in its entirety (no spaces or other characters left); in some instances a single character e.g., *, may denote a point of redaction No fixed algorithm; information is replaced with a series of (pseudo-)random characters; non-deterministic A fixed character pattern (usually a single character) replaces sensitive information; e.g., phone # may become: (xxx) xxx-xxxx NIST standard MD5 and SHA families; deterministic with the same salt; non-reversible NIST standard (AES and derivatives); block-oriented; deterministic and reversible with the same key NIST standard under consideration; field-oriented; retains field character; deterministic; reversible or non-reversible is user-selectable 7. TDP Use Cases VelociData offers an extremely valuable format-preserving data masking mode. This data security process conforms to the NIST 800-38G specification and allows users to encrypt (reversibly) or mask (irreversibly) data without changing its schema or field specifications (lengths and dictionaries are preserved). This enables downstream applications to run without any changes. Use cases include local targets, private and public clouds, and targets where data cross geographic, company, or regulatory boundaries. A data set containing 10 million records with ten sensitive fields in each record can be secured in seconds using VelociData rather than a day using conventional approaches. All Content 2015 Velocidata inc. Test Data Protection p.7
Figure 5: Schematic for Creating Secure Model Copies Mainframe Data Sources IMS DB2 VSAM Sensitive Data Data Center Regulatory, Company or Geographic Boundary Application Test Environment QA Database Masked Data (Model Copy) POC / Test RDBMS Log Files CSV Files Non-Mainframe Data Sources Sensitive Data Development Database 7.1 Use Case 1: Creating a secure, HIPAA-compliant full production dataset from Microsoft SQL Server A large health benefits provider needs to create a model copy of a full production dataset for access by their developers. All 18 PHI data field types need to de-identified for HIPAA/HITECH audit compliance. The production data is about 400 GB loaded into Microsoft SQL Server. Following the outline of Figure 5, a workflow is established that: 1. extracts data out of SQL Server; 2. secures the data through the VelociData appliance using format-preserving masking (to ensure data integrity and application usability); and 3. performs a bulk load of the model data into a development set of tables. As an example, one of the tables contains 1 Million records, each of which are comprised of 34 fields. For HIPAA Final Rule compliance 14 of the fields in each record need to be de-identified (totaling 14 M fields). The dataset included a number of different field types (names, SSNs,...) requiring the following dictionaries: Names Numbers Dates Numerics hex_uppercase hex_lowercase alphanumerics alphanumeric_uppercase alphanumeric_lowercase printable characters All Content 2015 Velocidata inc. Test Data Protection p.8
The overall processing time for this table including all database queries, masking operations, and insertion into the resulting database, was just over one minute (65 seconds). With the longest running process being the database insert 7.2 Use Case 2: Secure data for insertion into an Azure cloud A retail company must de-identify PII data from records it needs to share with its business partners. This sensitive data contains names, addresses, phone numbers, and other personally identifying data. The manufacturer wants to put the data into a hosted environment but cannot let unprotected data leave its firewall. For this reason they have chosen to use VelociData to de-identify the data in their datacenter before it leaves to enter the cloud. The data contains a large volume of daily transactions. The business associates require the freshest data to address immediate results of campaigns, implementation changes for agile app development, and preparing model reports. Figure 6: Schematic for Securing Data to a Cloud Datastore Corporate Firewall As identified in Figure 6, data move through the VelociData appliance de-identifying the PII data found within the data flow. These records then are allowed to move to the cloud-based storage for access by business associates of the retailer. Since no sensitive data remain there is no risk to the company or the individuals should unauthorized access be gained or data breach occur. All Content 2015 Velocidata inc. Test Data Protection p.9
7.3 Use Case 3: Securing test data for off-shore developers A major Telco would like to move production data to India to leverage faster, round-the-clock development and lower costs. In order to remove audit deficiencies they would like to generate a model copy of the data to send to off-shore. While de-identification removes the risk from leaking precious sensitive customer and corporate data the developers require access to a dataset that closely mirrors fresh production data in character such as volume, distribution, and relation. The dataset represents 30 million records and 12 fields per record that need to be de-identified. VelociData can provide a fresh test dataset for the off-shore partners in a minute where the alternate solution takes almost a week before the data are available in test... by then, the developers have a new application built to be tested! 7.4 Use Case 4: Creating daily datasets for Development, QA, and Test Integration A large financial institution needs to provide model datasets with de-identified data to different parts of the development process. While all data needs to be fully de-identified for every user, not all data needs to go to all groups; as an example Web Development may not need a field relating to fraud but Test Integration may need it to complete processing. VelociData Test Data Protection solution has the ability to route different dataset builds to different end users. Leveraging routing is fast and efficient and provides the right data, in the right form, to the right individuals. Proper data arrive at the given locations saving on storage and maintenance of TB of useless replicated data. 8. Summary The VelociData appliance offers an easy to deploy, easy to use solution for test data protection. The system does not require any coding for integration and operation in the existing software and data base environment. Rather, it operates as a simple network resource for automatically masking sensitive data at wire speed. The appliance can communicate with all kinds of systems, including mainframes, commodity servers, and cloud services and can work relational data, flat files, logs, and XML data, and it requires no additional software or hardware to operate. The VelociData Test Data Protection solution reduces regulatory exposure and hacker risk, and it improves software testing speed and agility. All Content 2015 Velocidata inc. Test Data Protection p.10
9. Let Us Help You For reducing hacker risk and regulatory exposure in test data protection, VelociData offers the fastest time to safety. If you are using custom coding or packaged software for test data protection, VelociData would like to show you how our unique appliance-based solution can significantly reduce your cost and increase the speed of your test data protection workflow. If you are testing software with sensitive data unprotected, you are taking a huge risk and should consider adopting some remedy immediately, either VelociData s or some other. We would like to show you how quickly you can make this problem go away. Please contact us at info@velocidata.com to see what we can do for you. Author: Ron Indeck Ron Indeck is the President & CTO of VelociData and has over 25 years of industry and academic experience, most recently as a founder and CTO of Exegy. He was a professor at Washington University in St. Louis, where he was the Das Family Distinguished Professor and Director of the Center for Security Technologies. Among his distinguished professional affiliations, Dr. Indeck was also the President of the Institute of Electrical and Electronics Engineers (IEEE) Magnetics Society. Dr. Indeck has been named the Bar Association Inventor of the Year. All Content 2015 Velocidata inc. Test Data Protection p.11