Anonymizing Unstructured Data to Enable Healthcare Analytics Chris Wright, Vice President Marketing, Privacy Analytics

Privacy Analytics - Overview For organizations that want to safeguard and enable their personal information for secondary use Purpose-built software that automates the deidentification and masking of data using a risk-based approach to anonymize personal information in compliance with HIPAA requirements Integrated capabilities to anonymize structured and unstructured data from multiple sources Peer-reviewed methodologies and value-added services that certify data for secondary use 2

Secondary Use for Healthcare Data Definition Secondary use of health data applies personal health information (PHI) for uses outside of direct health care delivery. It includes such activities as analysis, research, quality and safety measurement, public health, payment, provider certification or accreditation, marketing, and other business applications, including strictly commercial 1 activities. 1. Definition sourced from white paper, Toward a National Framework for the Secondary Use of Health Data: An American Medical Informatics Association White Paper, J Am Med Inform Assoc 2007;14:1-9 doi:10.1197/jamia.m2273 3

The Proliferation of Unstructured Data According to IBM, Ovum and other researchers, 80-90 percent of all medical data today is unstructured... and that volume is doubling 1 every five years. Electronic health records where personal information resides in XML as free form text and needs to be anonomyized for analysis Medical devices where unstructured data or free form text from machine dumps (i.e. x-ray machines or CAT scans) is sent to a database(s) for analysis Online Forums where patients or providers discuss their conditions or cases, requiring anonymization to facilitate sentiment analysis and other forms of information analysis 1. http://ovum.com/2012/05/11/unlocking-the-potential-of-unstructured-medical-data/ 4

PARAT Software Providing organizations with a scalable set of capabilities to automate the anonymization of structured and unstructured data Automate masking, de-identification and risk of reidentification Configure anonymization depending on the sensitivity of the data Maintain data consistency by matching structured values to corresponding unstructured data Measure the overall quality of anonymized data to ensure that the re-identification risk is very small and its analytic value is high Stronger Safeguards. Richer Analysis. Integrated Solution. 5

How We Anonymize Unstructured Data? 6

PARAT: Before De-identification 7

PARAT: Discovery and Annotate 8

PARAT: Discovery and Annotate 9

PARAT: After De-identification 10

SIDE-BY-SIDE COMPARISON: Data Utility Achieved 11

Balancing Privacy and Utility for Secondary Use 1 Data Quality 2 Analytic Granularity 3 Depth of Insight Ensuring de-identified data has analytic usefulness by determining its relative risk associated with its disclosure, sharing and re-sale Allowing users to configure de-identification for patient level data without compromising privacy and costly breaches Enabling analysis of the total patient health experience, to compile a complete picture of this experience from multiple data sources and types 12

PARAT: National Institutes of Health Challenge Wants to anonymize unstructured text data from more than 400,000 patients Seeks to augment currently available data in deidentified format Solution PARAT Text PARAT Text is a standalone module for PARAT Why Privacy Analytics De-identified unstructured data would allow researchers to: 1. Test hypotheses for new research 2. Confirm potential sample sizes for proposed research 3. Find collaborators for cross-disciplinary research studies. Customer Profile The National Institutes of Health (NIH), a part of the U.S. Department of Health and Human Services, is the nation s medical research agency making important discoveries that improve health and save lives. 13

Learn More Drop on by Booth 13 14