PREDICT: A Data Repository for Cyber Security Research Charlotte Scheper RTI International Manish Karir DHS S&T 1 RTI International is a trade name of Research Triangle Institute. www.rti.org
What is PREDICT? A Protected REpository for Defense of Infrastructure against Cyber Threats (PREDICT) A research data repository sponsored by DHS/S&T A trusted framework for sharing data for cyber security research A program for advancing tools, methods, and policies for collecting and sharing security-related Internet data In operation since 2008, PREDICT has shared realworld datasets for cyber security research to advance the state-of-the-art network security research and development. 2
Objectives Rationale Researchers had insufficient access to data unable to adequately test their research prototypes Government technology decision-makers and researchers need data to evaluate competing products Legal and ethical policies for Internet research are unclear Scientific method via repeatability of tests and evaluations needed support Project Impetus: National Strategy to Secure Cyberspace (February 2003) and 2009 Cyberspace Policy Review Expanding Public Access to the Results of Federally Funded Research see http://m.whitehouse.gov/blog/2013/02/22/expanding -public-access-results-federally-funded-research PREDICT Cyber Security Datasets Collect Archive Share PREDICT is the only freely available, legally collected repository of large-scale datasets containing real network traffic and system logs. 3
Key Issues Addressed Providing secure, centralized access to multiple sources of security-related data Assuring confidentiality Privacy of individuals Proprietary information Security of the networks from which the data are collected Establishing a legal structure to reduce legal risks Assuring data integrity Protect access to data Ensure proper use of data 4
PREDICT Data Sharing Model Sensitivity assessments for inclusion and access Legally binding terms and conditions for data use Protocols for repository operations Data requests subject to expert review and approval Centralized view of and portal into the repository and management of repository processes through a Data Coordinating Center 5
Privacy Outreach Activities Briefed privacy advocates and obtained input ACLU, Electronic Frontier Foundation (EFF), Center for Democracy and Technology (CDT), EPIC (invited) Prepared Privacy Impact Assessment (PIA) Worked with DHS Privacy Office Briefed government officials, privacy advocates, participants DHS S&T General Counsel DHS General Counsel Department of Justice 6
PREDICT Repository Framework Distributed Repository Multiple data providers collect & prepare data for sharing Multiple data hosts provide computing infrastructure to store datasets and provide access Central coordinating center provides portal and manages repository processes 7
Operational Context with No Data Repository 8
Operational Context with PREDICT 9
Current Repository Holdings Current Data Categories Address Space Allocation Data Border Gateway Protocol (BGP) Routing Data Blackhole Address Space Data Domain Name System (DNS) Data Intrusion Detection System (IDS) and Firewall Data Infrastructure Data Internet Topology Data Internet Protocol (IP) Packet Headers Performance and Quality Measurements Sinkhole Data Synthetically Generated Datasets Traffic Flow Data Unsolicited Bulk Email Data 407 Datasets - Collection periods vary from hours to days to months - Sizes vary from Bytes to TBytes Research groups that have used PREDICT - 97 academic institutions - 88 commercial entities - 37 Government organizations - 3 Foreign - 11 non-profit organizations 10
Current Data Host/Providers UCSD/CAIDA Topology Measurements, Network Telescope 45.4 TB USC - ISI NetFlow, Internet Topology Data, Address Allocation 42 TB Colorado State University NetFlow, Spam logs, IP Reputation lists 90 TB University of Michigan/Merit Networks Netflow, BGP Routing, Dark Address Space Monitoring, BGP Beacon Routing 188.7 TB Georgia Tech Botnet Sinkhole Connection 0.01 TB University of Wisconsin Global Intrusion Detection Database 2.5 TB Packet Clearing House BGP Routing, VoIP Measurement, Synthetic 8.0 TB TOTAL = 596+ TB 11
Data collection, storage, and access Advance the state of the art in data collection techniques, packet formats, new data types, data cataloging/annotation, cross dataset analysis Develop systems for storage and processing of large volumes of data Continue technical and policy work on disclosure control for Internet traffic data Add Data Access Methods such as VMs/Virtual Enclaves Advancement of tools and techniques to analyze Internet datasets to extract and represent useful information Center for Configuration Analytics and Automation (UNCC) project RTI International IR&D project Additional Research Activities Investigate and highlight legal and ethical issues in Internet data collection and analysis Menlo Report on Ethical Principles Guiding Information and Communication Technology Research More than 200 research papers/journals/technical reports using PREDICT datasets within the past 3 years 12
Add classes of data Unrestricted Quasi-restricted Restricted Streamline processes Improvements on the Way Cleaner account request process Click-through agreements for unrestricted and quasi-restricted classes Expand international cooperation framework Japan (Complete), Canada (Close), Australia (started) 13
Summary PREDICT is addressing an acknowledged need by providing large-scale, real-world security-related datasets for cyber security research PREDICT is addressing the significant policy and legal issues in collecting and sharing security-related data PREDICT is helping to achieve DHS s goal of improving the quality of defensive cyber security technologies 14
PREDICT Information https://www.predict.org DHS Privacy Impact Assessment: http://www.dhs.gov/xlibrary/assets/privacy/privacy _pia_st_predict.pdf 15
Contact Information Charlotte Scheper Director PREDICT Coordinating Center RTI International USA cscheper@rti.org 919-485-5587 Manish Karir Program Manager Cyber Security Division DHS S&T USA Manish.Karir@hq.dhs.gov 202-407-0690 16