Obfuscation of sensitive data in network flows 1

Obfuscation of sensitive data in network flows 1 D. Riboni 2, A. Villani 1, D. Vitali 1 C. Bettini 2, L.V. Mancini 1 1 Dipartimento di Informatica,Universitá di Roma, Sapienza. E-mail: {villani, vitali, mancini}@di.uniroma1.it 2 Dipartimento di Informatica e Comunicazione, Universitá degli Studi di Milano. E-mail: {daniele.riboni,claudio.bettini}@unimi.it 20 January 2012 1 InfoCom 2012, the 31st Annual IEEE International Conference on Computer Communications (to appear)

Table of contents Internet Infrastructure and Data set definition

Internet Actors IP Prefix (or network prefix): rappresentation of a set of IP, e.g. 192.168.1.0/24; Autonomos Systems (AS): is a collection of connected Internet Protocol routing prefixes under the control of one or more network operators; Internet Service Provider: is a company that provides access to the Internet; Internet exchange Point: is a physical infrastructure through which Internet Service Providers exchange Internet traffic between their networks;

Internet Infrastructure:Border Gateway Protocol (BGP) Hierarchical infrastructure: Tier 1: Full mesh network Tier 2: National Internet providers Tier 3: Local Internet Service Providers... Internet today: about 40.000 autonomous systems and 400.000 IP Prefixes

Internet routing protocol: BGP AS1 announce IP prefix X AS2 say to AS3: in order to reach IP X, packets cross through AS2,AS1 each topology change causes new updates or prefix withdraws

Data set definition:cisco TM Netflows Netflow is a network protocol developed by Cisco TM Systems for collecting IP traffic information. real time collection; active and passive timeouts; lightweight representation of network traffic; high representive; Netflow data can be used as support for Traffic and Attacks Detection, network monitoring, QoS and other network activities.

Data set fields definition A network flow has been defined in many ways. The traditional definition is to use a 7-tuple key, where a flow is defined as a unidirectional sequence of packets all sharing all of the following 7 values: Source IP address Destination IP address Source port for UDP or TCP, 0 for other protocols Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols IP protocol Ingress interface (SNMP ifindex) IP Type of Service

ExtrABIRE project: network flows probe Large set (more than 1 year) of network flows gathered from BGP router of Commercial and Istitutional Internet Service Provider. Data set expressiveness: 2 GBytes of full netflow entries contain 110 millions of flows, 2 billions ofpackets corrisponding to 5TByte of exchanged data.

The role of network flows data sets in network communities Log of network flows are a fundamental tool for modeling the network behavior, identifying security attacks, and validating research results. Security and privacy concerns inhibit the release of network data. Research experiments and evaluations of proposed algorithms use synthetic data: often random network data generated by stocastic distribution differs from real data; old and short data sets: new protocol, network paradigms as well as new network attacks strategy doesn t appear in these data sets;

Effects of the lacks of shared network flows Dark side effects: research results become hard to evaluate; research results are inconsistent; experiments are not reproducible; application of proposed strategy with real data provides unexpected results;...

Anonymity, meaning without a name or namelessness ; anonymity typically refers to the state of an individual s personal identity, or personally identifiable information, being publicly unknown. aimed to: de-anonymization of data sets; inferring private informations; obtains useful information about attack target networks.

: Taxonomy

Network flows data sets attacks: grouping by precondition 2 2 J. King, K. Lakkaraju, and A. J. Slagell, A taxonomy and adversarial model for attacks against network log anonymization, in Proc. of ACM SAC. ACM, 2009, pp. 1286 1293.

Network flows data sets attacks: Fingerprint Fingerprint: identification is performed by matching flows fields values to the characteristics of the target environment; i.e. knowledge of network topology or services of target hosts, etc.; Injection: the adversary injects a sequence of flows in the network to be logged, that are easily recognized due to their specific characteristics; e.g., marked with uncommon TCP flags, or following particular patterns

Network flows data sets attacks: Web Fingerprint In this paper we attempt to quantify the risks of publishing anonymized packet traces. [...], we examine whether statistical identification techniques can be used to uncover the identities of users and their surfing activities from anonymized packet traces. Our results show that such techniques can be used by any Web server that is itself present in the packet trace and has sufficient resources to map out and keep track of the content of popular Web sites to obtain information on the network-wide browsing behavior of its clients. 3 3 D. Koukis, S. Antonatos, and K.G. Anagnostakis,On the Privacy Risks of Publishing Anonymized IP Network Traces In Proceedings of Communications and Multimedia Security

Previous approaches Previous approaches provide encryption of identity fields (IP address) and different techniques on quantitative fields (e.g. TCP flags, traffic stats, etc.) permutation truncation generalization No formal proof of the obfuscation property of the solution proposed are provided!

Data anonymity approaches Definition (Fingerprint Quasi Identifier (fp-qi)) A field of a network flow is denoted as a fingerprint Quasi Identifier (fp-qi) if its value, possibly combined with external knowledge about the characteristics of the network hosts, can reduce the cardinality of the candidate set for source or destination IP addresses of the flow in L (obfuscated netflow dataset).

fp QI fields in netflow entry Source IP address Destination IP address Source port for UDP or TCP, 0 for other protocols Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols IP protocol Ingress interface (SNMP ifindex) IP Type of Service (flags)

Data anonymity approaches: K-anonymity K-anonymity Making any record indistinguishable in a group of at least K records based on quasi-identifier (QI) values (example) If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-anonymity.

Data anonymity approaches: k-anonymity attacks k-anonymity does not provide privacy if Sensitive values in an equivalence class lack diversity (Homogeneity Attack, e.g. Bob, 27 years) The attacker has background knowledge A. MachanavaJJhala, D- Kifer, J. Gehrke, M. Venkitasubramaniam, l Diversity: Privacy Beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data (TKDD)

Data anonymity approaches: l-diversity Each equivalence class has at least l well-represented sensitive values (example) In one equivalent class, there are ten tuples. In the Disease area, one of them is Cancer, one is Heart Disease and the remaining eight are Flu. This satisfies 3-diversity, but the attacker can still affirm that the target person s disease is Flu with the accuracy of 70%.

Data anonymity Drawbacks: database anonymity strategy are effective only under the assumption that each individual is the respondent of at most one record in the released microdata. In a network flows data sets, each IP (identity) can appers more and more times.. Data anonymity tecniques are not directly suitable!

Idea: Goal In this work, we propose a novel obfuscation technique for network flows that provides formal guarantees under realistic assumptions about the adversary s knowledge (fingerprint or injection attacks).

Idea: IP A: 129.19. 133.199 original flow f IP D: 66.200. 181.12 IP B: 213.16. 92.171 IP GROUP α obfuscated flow f * f*[fp-qi] = g*[fp-qi] IP GROUP β IP E: 72.149. 130.8 IP C: 194.15. 20.101 IP F: 194.158. 20.101... f* is indistinguishable from g* based on the hosts fingerprint... IP A: 129.19. 133.199 IP G: 68.120. 47.25 IP B: 213.16. 92.171 IP GROUP α obfuscated flow g * f*[fp-qi] = g*[fp-qi] IP GROUP δ IP H: 123.163. 32.80 IP C: 194.15. 20.101 original flow g IP I 203.48. 4.172 Make Group IP of K addresses based on their behavior affinity; Group flow such that at most J distinct IP share the same flow values.

: algorithm details 1/3 Input L: original set of network flows; fp-qi fp QI : set of fingerprint Quasi Indentified K: minimum group size Output L : Obfuscated data set

: algorithm details 2/3 Input L: original set of network flows; fp-qi fp QI : set of fingerprint Quasi Indentified K: minimum group size Output IP Groups: G 1, G 2,..., G j IP Groups identifier: GID 1, GID 2,..., GID j

: algorithm details 3/3 Input L: original set of network flows; fp-qi fp QI : set of fingerprint Quasi Indentified j: minimum number of ftp-indistinguishable flows τ: time granularity Output L : Obfuscated data set

: obfuscated data sets Each non fp-qi field changes as follow: src,dstip Group IP byte, packets (min, max) interval tos, proto set of values flags Xor-ed values

Suppressed flows 60 50 j=2 j=3 j=4 j=5 j=6 j=7 Suppressed flows (%) 40 30 20 10 0 1 2 4 8 16 32 Time granule τ (minutes)

Obfuscated data set quality There are no universally accepted criteria to evaluate Obfuscated or anonymized data set. Usually, many network data analysis tecniques use Information theory based approaches or statistical informations. Entropy based Query based

Experiment results: Information theory based analysis Entropy of source IP addresses distribution (one hour and week) H(x) = (p i log pi ) 9 original flows k=5 k=10 k=20 9 original flows k=5 k=10 k=20 Entropy on source IP addresses 8 7 6 Entropy on source IP address 7 5 5 12:00 12:05 12:10 12:15 12:20 12:40 12:35 12:30 12:25 Time of the day 12:45 12:50 12:55 13:00 Mon Tue Wed Sat Fri Thu Day of the week Sun Mon Tue

Experiment results: Information theory based analysis Entropy of destination IP addresses distribution (one hour and week) Entropy on destination IP addresses 10 9 8 7 original flows k=5 k=10 k=20 Entropy on destination IP address 10 8 6 original flows k=5 k=10 k=20 6 12:00 12:05 12:10 12:15 12:20 12:40 12:35 12:30 12:25 Time of the day 12:45 12:50 12:55 13:00 Mon Tue Wed Thu Sat Fri Day of the week Sun Mon Tue

Experiment results: statistical analysis We executed queries for each possible value/range, and for each minute in a one-hour time window, for a total of about 120, 000 queries. For each query, we calculated the error rate by the following formula: r t e = r t r t where r (resp. r ) is the result of the query on the original (resp. obfuscated) flows, and t (resp. t ) is the total number of original (resp. obfuscated) flows.

Experiment results: Tos, Proto, Flag bucketization 80 70 60 j=2 j=3 j=4 j=5 j=6 j=7 60 50 j=2 j=3 j=4 j=5 j=6 j=7 90 80 70 j=2 j=3 j=4 j=5 j=6 j=7 Average error (%) 50 40 30 Average error (%) 40 30 20 Average error (%) 60 50 40 30 20 10 10 20 10 0 1 2 4 8 16 32 Time granule τ (minutes) Figure: proto field 0 1 2 4 8 16 32 Time granule τ (minutes) Figure: flag field 0 1 2 4 8 16 32 Time granule τ (minutes) Figure: tos field

Experiment results: bytes and packets bucketization 60 50 j=2 j=3 j=4 j=5 j=6 j=7 45 40 35 j=2 j=3 j=4 j=5 j=6 j=7 Average error (%) 40 30 20 Average error (%) 30 25 20 15 10 10 5 0 1 2 4 8 16 32 Time granule τ (minutes) 0 1 2 4 8 16 32 Time granule τ (minutes) Figure: Query on byte field Figure: Query on packet field

K-J Obfuscation benefit Make flows indistinguishable by a fingerprint attack; Preserve traffic diversity and data quality; Formal guarantee of works (refer to paper for more details).

Questions? Thanks