Data Privacy and Biomedicine Syllabus - Page 1 of 6 Course: Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin, Ph.D. (b.malin@vanderbilt.edu) Semester: Spring 2015 Time: Mondays & Wednesdays, 3:10 4:25pm Location: Featheringill Hall, Room 313 Website: http://www.hiplab.org/courses/bmif380/ Office Hours: Upon Appointment DESCRIPTION The integration of information technology into biomedical environments has enabled unprecedented advances in the collection, storage, analysis, and rapid dissemination of patientspecific data to physicians and researchers. Given the potential wealth of such detailed data for further advances in healthcare, many organizations associated with the healthcare domain share, or anticipate sharing, their collections for various purposes related to quality assurance, public health, and research. However, in the face of today s complex networked environments, many organizations find it increasingly difficult to share biomedical data due to concerns about patient privacy. For instance, how can we share patient-specific data without revealing the identity of the patient? Security practices, such as role-based access control and encrypted communications ensure authentication and secure communications, but they do not necessarily stem the leakage of inferences from the data after it has been accessed or transmitted. Thus, this course is concerned with the analysis and protection of data privacy with a focus on the idiosynchrasies and regulatory framework associated with biomedical information. The goal of this course is to introduce students to the computational challenges, as well as formal privacy protection solutions, for data privacy in healthcare and biological research environments. The topology of data privacy is a highly interdisciplinary landscape and material in this course will touch on issues and methodologies from bioinformatics, cryptography, data mining, databases, distributed systems, law, machine learning, medical informatics, policy, and statistics. OBJECTIVES After this course, students will be able to analyze data privacy issues from three non-exclusive perspectives: 1. Data Detectives: Oftentimes data is shared with false beliefs about privacy and data protection. From this perspective students will learn how seemingly private information, can be learned using automated strategies. 2. Data Protectors: Students will learn how to construct privacy protection technologies that provide formal computational guarantees of privacy in data collection and sharing. 3. Technology Policy Designers: Computational models provide a basis for protection, but in order to implement such technology in the real world, it must support, and not circumvent, existing policy specification. From this perspective, students will learn how to develop privacy protection solutions which complement policy regulations.
Data Privacy and Biomedicine Syllabus - Page 2 of 6 PREREQUISITES Required: Students are expected to have proficiency in designing and writing software programs. There is no programming language requirement for this class, though experience with object orientation is beneficial. Recommended: Students should be comfortable with learning about basic statistics, data structures, and algorithm analysis. When appropriate, quantitative and computational methodology will be reviewed. Knowledge of, and prior experience with, security principles is NOT a prerequisite for this course. GRADING Criteria Percent of Grade Project 50% (Initial Proposal, Due March 16) (5%) (Status Report, Due March 31) (15%) (Final Report & Presentation, Due on April 27) (30%) Homework Assignments (3 assignments, 10% each) 30% Reading Summaries 10% Class Participation 10% 100% Required Reading Assignments: There is no primary textbook for this course. Reading assignments will be selected from various periodicals. Students will be required to read and submit brief summaries of assigned readings. Your summaries should be no longer than one page in length. Readings will be made available online or as in-class handouts at least one lecture before they are due. Your summaries will be graded on a {check-minus, check, checkplus} scale. - (or 1 point): You skimmed the assigned reading and barely understood, or summarized, its meaning and implications. (or 2 points): You demonstrated that you read the material by providing a reasonable account of its contents, its strengths, and weaknesses. + (or 3 points): You provided a critical assessment of the reading and show insight regarding the reading s topic. These summaries constitute a total of 10% of your final grade. An average score of (i.e., 2 points) will provide the student with the full 10%. An average score greater than (i.e., greater than 2 points) will entitle the student to extra credit, with a maximum of 5 additional percentage points on their final grade. You must email your summaries to b.malin@vanderbilt.edu before the beginning of class.
Data Privacy and Biomedicine Syllabus - Page 3 of 6 Project: In lieu of a final exam, each student must complete an independent project on a data privacy issue in biomedicine. Projects should investigate a topic of interest to the student, and must demonstrate analysis and critical thinking in data privacy. The project will require a significant commitment and contribute to a substantial part of the final grade. A list of sample project topics will be made available and reviewed in class. Honesty Policy: From the Vanderbilt Student Handbook, HONESTY is a commitment to refrain from lying, cheating, and stealing. Recognizing that dishonesty undermines community trust, stifles the spirit of scholarship, and threatens a safe environment, we expect ourselves to be truthful in academic endeavors, in relationships with others, and in pursuit of personal development. You are permitted, and encouraged, to discuss homework assignments with other students. However, you must do your own work and submit your own solutions. TOPIC AND SCHEDULE OVERVIEW (Tentative and Subject to Change) Part 1 (Jan 5 and 7): Course Overview, a Little Policy, and Whole Lot of Data In the first class, we ll go over ground rules for the course and review the syllabus. If time permits, we ll begin a discussion on what data privacy is (and is not). We ll investigate how it relates to data security principles, such as authorization, access control, and authentication. In the second (though first full) class, we will survey various ideologies, legal, and policy precedents for privacy in modern healthcare environments and society. Who collects medical information and when do patients have control over their privacy? Can policy and specification of privacy protections be automated? Part 2 (Jan 12 and 14): De-identification & Re-identification This week we will look at how seemingly de-identified medical information can be reidentified to the individuals from which it was collected. In the process, students will learn statistical techniques for characterizing uniqueness in data, both at elemental and population levels of granularity. We ll look at how personal information is available in many different resources both onand offline. Where is this information? How do we automatically capture and organize it for privacy assessments? We will look at various information repositories, such as vital records and statistics (including birth records, death records, marriage records, court documents) and the Social Security Death Index. It will also review systems are set up to track people through identifiers and traceable elements. As an example, we will discuss the potential of unique numbers for persistent patient identifiers and the history of the Social Security Number in the United States. And how recent policy changes in the United States influence these opportunities. Part 3 (Jan 19, 21, and 26): Record Linkage There is no class on Jan 19 (Martin Luther King Day). This week we will investigate concepts and methodology associated with the linkage of data in disparate databases. Methods will be drawn from a deterministic perspective.
Data Privacy and Biomedicine Syllabus - Page 4 of 6 We will also discuss how linkage methods can be automated and their application within electronic medical record systems. The second part of this section will be dedicated to more sophisticated strategies of record linkage. It will move beyond basic deterministic methods and will look at more formal probabilistic strategies, with a particular emphasis on expectation-maximization (EM) frameworks. Part 4 (Jan 28, Feb 2, 4, and 9): Access Control Models and Auditing This section of the course will look into how access control frameworks, and particularly roles, can be defined. We will look into formal models of access specification, how to design roles in a manner that meet organizational needs, and algorithms for automatically constructing role hierarchies through simple data mining strategies. While, access control provides a framework for specifying who is entitled to access what information and when, there are many situations in which access control cannot be sufficiently specified or must be circumvented to enable timely care of patients. In this section of the course, we will also look at how information in the access logs of electronic medical records and the records of the patients themselves and can be mined for auditing purposes. Portions of this section of the course will be taught by Dr. You Chen. Part 5 (Feb 11, 16, 18, 23): Anonymization In this part of the course, we will begin to turn the table and shift from reidentification to anonymisation frameworks which can be designed to explicitly prevent such attacks. We will begin with formal models of anonymity, in which guarantees are provided on the extent to which data can be exploited for linkage and re-identification purposes. In particular, students will learn about the k-map and k-anonymity models, as well as algorithmic approaches to transform biomedical data to satisfy a formal anonymity model. We ll explore heuristic and approximation algorithms to achieve efficient anonymization. Students will also exposed to graph-based methods for modeling the re-identification and anonymization problem. These approaches will be presented in the context of multiple types of data encountered in the biomedical realm, including relational and set-valued data types. Portions of this section of the course will be taught by Dr. Raymond Heatherly. Part 6 (Feb 25 and March 9): Natural Language Scrubbing In this section of the course, we ll focus on de-identification in the context of free text (e.g., doctor s notes, laboratory reports, discharge reports, and more). How can we deidentify text information? Can we ever achieve anonymized text? This section of the course will review various data-intensive methods for processing natural language to detect and remove or scrub personal identifiers from clinical text. Part 7 (March 10) Ethical Reasoning This part considers of some of the ethical issues in the application of data privacy technologies. Simply because you can build a re-identification technology, doesn t
Data Privacy and Biomedicine Syllabus - Page 5 of 6 mean that you should use it, does it? We ll investigate ethical reasoning and governance models for dual use technologies. Part 8 (March 16, 18, 23, and 30): High-Dimensionality Data and Beyond Anonymization In the first lecture of the week, students will learn about the homogeneity attack against anonymization algorithms, also known as attribute disclosure. We will then explore frameworks and algorithms to mitigate this attack in the context of health information sharing for various secondary use cases. In the second part of this week, we ll explore how higher dimensional data can be exploited in various settings to perform attribute disclosure. We will particularly foucs on: high-throughput technologies that becoming ingrained in the clinical environment, the collection and sharing of biological information, such as DNA data. We will investigate ways in which patient identity in genomic data is protected, how it is re-identified and how it can be formal protected geospatial technologies that are becoming standard practice in public health and epidemiology settings. These problems require geographic information regarding the presence of clinically interesting cases to detect potential outbreaks and bioterrorist activities. However, the sharing of geographic and spatiotemporal information may lead to re-identification. We will investigate various approaches by which such information may be protected during data sharing. Social networks that are becoming popular settings for integrating diseasebased communities, interacting with support groups, and performing population-based pharmacoepidemiology studies. Project Status Report Presentations (March 25 note this is in the middle of part 8) This day will be dedicated to student projects. Students will write a short summary of their problem statement, initial research design, and make a short presentation on the status of their projects for an in-class evaluation. Part 9 (April 1, 3, and 8) Image and Mobile Privacy This portion of the course will look into images and video stream, which are increasingly used for monitoring and surveillance in health care environments, such as managed care facilities. We will investigate several procedures and principles for removing personally identifying features, e.g., an individual s face, from video streams. We will also investigate how images, e.g., the picture of a face, are a special case of video streams, can be protected using formal models of anonymity. Part 10 (April 13 and 15): Privacy Preserving Data Mining This section of the course will move beyond traditional de-identification and anonymization models. First, we will look into basic variations of secure multiparty computation (SMC). The traditional application of cryptography is framed from a twoparty viewpoint in which two participants, Alice and Bob, exchange information, such as
Data Privacy and Biomedicine Syllabus - Page 6 of 6 a patient's medical record, over an unsecured channel. An extension to the traditional model is secure multiparty computation, which is concerned with the interaction of two or more participants that need to exchange information to construct a result without revealing private information. Next, we will go into further depth regarding how such protocols can be adapted to support record linkage frameworks without revealing the identities of the corresponding patients Week 14 (April 20): Student Final Presentations The final lecture will be dedicated to students presentations on their final projects.