A Practical Framework for Privacy-Preserving Data Analytics

A Practica Framework for Privacy-Preserving Data Anaytics ABSTRACT Liyue Fan Integrated Media Systems Center University of Southern Caifornia Los Angees, CA, USA iyuefan@usc.edu The avaiabiity of an increasing amount of user generated data is transformative to our society. We enjoy the benefits of anayzing big data for pubic interest, such as disease outbreak detection and traffic contro, as we as for commercia interests, such as smart grid and product recommendation. However, the arge coection of user generated data contains unique patterns and can be used to re-identify individuas, which has been exempified by the AOL search og reease incident. In this paper, we propose a practica framework for data anaytics, whie providing differentia privacy guarantees to individua data contributors. Our framework generates differentiay private aggregates which can be used to perform data mining and recommendation tasks. To aeviate the high perturbation errors introduced by the differentia privacy mechanism, we present two methods with different samping techniques to draw a subset of individua data for anaysis. Empirica studies with reaword data sets show that our soutions enabe accurate data anaytics on a sma fraction of the input data, reducing user privacy risk and data storage requirement without compromising the anaysis resuts. Categories and Subject Descriptors H.2.7 [Database Management]: Database Administration Security, integrity, and protection; H.2.8 [Database Management]: Database Appications Data mining eywords Data Anaytics, Differentia Privacy, Samping. INTRODUCTION We ive in the age of big data. With an increasing number of peope, devices, and sensors connected with digita networks, individua data now can be argey coected and anayzed to understand important phenomena. One exampe is Googe Fu Trends, a service that estimates fu activity by aggregating individua search work done whie interning with Samsung. http://www.googe.org/futrends/ Copyright is hed by the Internationa Word Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperink to the author s site if the Materia is used in eectronic media. WWW 25, May 8 22, 25, Forence, Itay. ACM 978--453-3469-3/5/5. http://dx.doi.org/.45/2736277.27422. Hongxia Jin Samsung R&D Research Center San Jose, CA, USA hongxia@acm.org Records 8 6 4 2 8 6 4 2 Data Loss Privacy Surpus 2 3 4 5 Users cutoff Figure : Record Distribution of Netfix Users queries. In the retai market, individua purchase histories are used by recommendation toos to earn trends and patterns. Performing anaytics on private data is ceary beneficia, such as eary detection of disease and recommendation services. However, user concerns rise from a privacy perspective, with sharing an increasing amount of information regarding their heath, ocation, service usage, and onine activities. As a matter of fact, the uniqueness of each user is increased by the big coection of individua data. The AOL data reease in 26 is an unfortunate exampe of privacy catastrophe [], in which the search ogs of an innocent citizen were quicky identified by a newspaper journaist. A recent study by de Montjoye et a. [9] concudes that human mobiity patterns are highy unique and four spatio-tempora points are enough to uniquey identify 95% of the individuas. In order to protect users from re-identification attacks, their private data must be transformed prior to reease for anaysis. The current state-of-the-art paradigm for privacy-preserving data anaysis is differentia privacy [], which aows un-trusted parties to access private data through aggregate queries. The aggregate statistics are perturbed by a randomized agorithm, such that the output remains roughy the same even if any user is added or removed in the input data. Differentia privacy provides a strong guarantee: given the output statistics, an adversary wi not be abe to infer whether any user is present in the input database. However, this indistinguishabiity can be ony achieved at high perturbation cost. Intuitivey, the more data a user contributes to the anaysis process, the more perturbation noise is needed to hide his/her presence. In some cases, a user coud generate an unbounded amount of data, such as purchase or check-in history, the addition or remova of which may resut in unimited impact on the output. The chaenge of enforcing differentia privacy is that it incurs a surpus of privacy cost, i.e. high perturbation error, being designed to protect each user according to the highest possibe data contribution. In reaity, ony a very sma number of users generate arge amount of persona data, whie the rest contribute itte data each. As shown in Figure, out of 5 users from Netfix prize com- 3

petition [2], ony user generated around 7 data records, whie the majority of users generated much ess persona data, ess than 2 data records each. If a upper bound is imposed on individua user data contribution, the surpus of privacy, e.g. high perturbation noise, can be reduced at the cost of data oss, i.e. part of data from those users who contributed more than the threshod. To imit individua data contribution, some strategies have been adopted by severa works [6][25]. The authors of [6] used the first d search queries submitted by each user, and the work in [25] reduced the number of items contained in each transaction to with smart truncation. However, there has been no discussion on the choice of the bounds, i.e., d and. Furthermore, the choice of actua user records (or items in a singe transaction) remains non-trivia, for generic appications. With a rigorous privacy notion, we consider how to anayze individuay contributed data to gain a deep understanding of service usage and behavior patterns, for various appication domains. We woud ike to understand the impacts of privacy and data oss on the resuting data anaytics, and design agorithms to draw private data accordingy. Exampe data anaytica questions are: Which paces do peope visit on Thursdays? and What are the most popuar movies with femae watchers under age 25? We formay define the tasks as database queries and detais are provided in Section 3. Contributions. In this paper, we address the probem of differentiay private data anaytics, where each user coud contribute a arge number of records. We propose a generic framework to generate anaysis resuts on a samped database, and study two samping methods as we as the samping factor in order to achieve a baance between data oss and privacy surpus. We summarize the contributions of this paper as foows: () We propose a generic, samping-based framework for an important cass of data anaytica tasks: top- mining and contextaware recommendation. We consider the probem of reeasing a set of count queries regarding the domain-specific items of interest as we as customizabe predicates to answer deep, anaytica questions. The count queries are perturbed prior to reease such that they satisfy differentia privacy. (2) We design two agorithms that draw a sampe of user records from the raw database and generate anaysis resuts on the samped data. The agorithm randomy sampes up to records per user. The HPA agorithm seects up to records from each user that are most usefu for the specific anaytica tasks. The utiity of each record can be customized based on the actua appication domain. We outine each samping method and provide pseudo code for easy impementation. (3) We provide anaysis on the accuracy of random samping, i.e. Mean Squared Error of reeased counts, with respect to the samping factor. We concude that the optima vaue is positivey correated the privacy constraint. We show that performing record samping on individua user s data does not infict extra privacy eakage. We formay prove that both samping agorithms satisfy differentia privacy. (4) We conduct extensive empirica studies with various reaword data sets. We compare our approaches with existing differentiay private mechanisms and evauate the accuracy of reeased count data with three utiity metrics. The experimenta resuts show that athough performed on a sma samped database, our methods provide comparabe performance to the best existing approaches in MSE and L-divergence, and superior performance in top- discovery and context-aware recommendation tasks. The HPA agorithm yieds higher precision, whie the agorithm preserves we the distributiona properties in reeased data. We beieve that our privacy-preserving framework wi enabe data anaytics for a variety of services, reducing user privacy cost and data storage requirement without compromising output utiity. The rest of the paper is organized as foows: Section 2 briefy surveys the reated works on privacy-preserving data pubishing and anaytics. Section 3 defines the probem and privacy notion. Section 4 presents the technica detais of the proposed framework and two samping agorithms. Theoretica resuts of privacy guarantees are provided in Section 5. Section 6 describes the data set and presents a set of empirica studies. Finay, Section 7 concudes the paper and states possibe directions for future work. 2. RELATED WORS A pethora of differentiay private techniques have been deveoped since the introduction of ɛ-differentia privacy in [2]. Here we briefy review the most recent, reevant works to our probem. Differentia Privacy. Dwork et a. [2] first proposed ɛ-differentia privacy and estabished the Lapace mechanism to perturb aggregate queries to guarantee differentia privacy. Since then, two variants have been proposed and adopted by many works as reaxations of ɛ-differentia privacy. The (ɛ, δ)-probabiistic differentia privacy [9] achieves ɛ-differentia privacy with high probabiity, i.e. ( δ). The (ɛ, δ)-indistinguishabiity [, 2] reaxes the bound of ɛ-differentia privacy by introducing an additive term δ. Our work adopts the strict definition of ɛ-differentia privacy and the Lapace mechanism to reease numeric data for anaysis. Data Pubication Techniques. A pethora of works have been proposed to pubish sanitized data with differentia privacy. To ist a few representatives among them, there is histogram pubication for range queries [7], for a given workoad [24], and for sparse data [8]. The majority of data pubication methods consider settings where each user contributes ony one record, or affects ony one reeased count. In contrast, we focus on those services where each individua may contribute a arge number of records and coud even have unbounded infuence on the reeased count queries. Bounding Individua Contribution. Here we review works estabished in a simiar probem setting, i.e. where individua data contribution is high, i.e. high goba sensitivity. The work of Nissim et a. [2] proposed smooth sensitivity, which measures individua impact on the output statistics in the neighborhood of the database instance. They showed that smooth sensitivity aows a smaer amount of perturbation noise injected to reeased statistics. However, it does not guarantee ɛ-differentia privacy. Proserpio et a. [22] recenty proposed to generaize ɛ-dp definition to weighted datasets, and scae down the weights of data records to reduce sensitivity. Rastogi and Nath [23], Fan and Xiong [3] and Chan et. a [5] studied the probem of sharing time series of counts with differentia privacy, where the maximum individua contribution is T, the number of time points. The authors of [23] proposed to preserve ony k discrete Fourier coefficients of the origina count series. The FAST framework in [3] reduced the sensitivity by samping M points in a count series and predicts at other points. The work [5] proposed the notion of p-sum to ensure each item in the stream ony affects a sma number of p-sum s. Two works by oroova et a. [6] and Hong et a. [4] addressed the differentiay private pubication of search ogs, where each user coud contribute a arge search history. The work of [6] keeps the first d queries of each user, whie the work of [4] expicity removes those users whose data change the optima output by more than a certain threshod. Zeng et a. [25] studied frequent itemset mining with differentia privacy and truncated each individua transaction to contain up to items. Recenty, earis and Papadopouos [5] 32

2 3 4 2 3 more specificay at Recommend good paces to visit on a Tuesday!. Moreover, we consider the probem of performing the above tasks on reeased count data. As in Figure 2, each V i s represents an item of interest, e.g. a restaurant, and each A j represents a vaue of the context, e.g. Monday. For each V i, the number of records containing V i is reeased. For each edge connecting V i and A j, the number of records containing V i and A j is reeased. As a resut, top- discovery can be performed on the item counts and contextaware recommendation on the edge counts connected to any context A j. We formay state the probem to investigate beow. Figure 2: Reeasing Counts for Data Anaytics proposed to reease non-overapping count data by grouping simiar coumns, i.e. items in our definition. In their work, each user is aowed to contribute no more than one records to each coumn, thus the maximum individua contribution is bounded by the number of coumns. However, the binary representation of user data may not truy convey information about each coumn, i.e. pace of interest or product. For exampe, when the bit for a user and a ocation is set, we cannot distinguish whether it was an accidenta check-in or the user went there many times due to persona preference. Samping Differentia Privacy. There have been a few works which studied the reationship between samping and differentia privacy. Chaudhuri and Mishra [6] first showed that the combination of k-anonymity and random samping can achieve differentia privacy with high probabiity. Li et a. [7] proposed to sampe each record with a caibrated probabiity β and then perform k- anonymity on the samped data, to achieve (ɛ, δ)-indistinguishabiity. Both works adopt the random samping technique which sampes a data record with certain probabiity. However, when appied in our setting, no guarantee is provided on bounding the individua data in the samped database. Our Competitors. After reviewing existing differentiay private technique, we identify three works that aow high individua contribution, reease aggregate statistics, and satisfy ɛ-differentia privacy. The first is a straight-forward appication of Lapace perturbation mechanism [2] to each reeased count, denoted as LPA. The second is the Fourier transform based agorithm from [23], which can be adapted to share count vectors, denoted as DFT. The third is GS, which is the best method proposed in [5]. 3. PRELIMINARIES 3. Probem Formuation Suppose a database D contains data records contributed by a set of n users about m items of interest. Each item coud represent in reaity a pace or a product. Each record in dataset D is a tupe (rid, uid, vid, attra), where rid is the record ID, uid corresponds to the user who contributed this record, vid is the item which the record is about, and attra represents contextua/additiona information regarding this record. In reaity, various information is often avaiabe in the actua database, such as transaction time, user ratings and reviews, and user demographic information. In our probem setting, attra can be an attribute, e.g. dayofweek, or a set of attributes, e.g. (Gender, Age), which can be customized to offer deep insight in a specific appication domain. Let h denote the number of possibe attra vaues. To be more concrete, we seect two anaytic tasks, i.e. top- discovery and context-aware recommendation, to iustrate the usabiity of our soutions. The first task answers questions as What are the most popuar paces in city X?, whie the second task aims DEFINITION D, reease (ITEM COUNTS). For each item V i in database c i(d) seect from D where vid = V i () DEFINITION 2 (EDGE COUNTS). For each edge connecting item V i and attribute A j, reease c i,j(d) seect from D where vid = V i and attra = A j (2) PROBLEM (PRIVATE DATA ANALYTICS). Given database D and privacy parameter ɛ, reease a sanitized version of item counts and edge counts, such that the reeased data satisfies ɛ- differentia privacy. Note that the probem definition, i.e. the counting queries to reease, can be customized according to the anaytica task to perform. For instance, to understand the correation between items, the bipartite graph in Figure 2 can be adapted as foows: A nodes wi be repaced by items, i.e. V nodes; and each edge (V i, V j) represents the number of times that V j is purchased/watched/visited by users who aso purchase/watch/visit V i. Simiary, those counts can be reeased privatey with sight adaption of our proposed soutions beow. 3.2 Privacy Definition The privacy guarantee provided by our work is differentia privacy [4]. Simpy put, a mechanism is differentiay private if its outcome is not significanty affected by the remova or addition of any user. An adversary thus earns approximatey the same information about any individua, irrespective of his/her presence or absence in the origina database. DEFINITION 3 (ɛ-differential PRIVACY). A non-interactive privacy mechanism A : D T satisfies ɛ-differentia privacy if for any neighboring databases D and D 2, and for any set D T, P r[a(d ) = D] e ɛ P r[a(d 2) = D] (3) where the probabiity is taken over the randomness of A. The privacy parameter ɛ, aso caed the privacy budget [2], specifies the degree of privacy offered. Intuitivey, a ower vaue of ɛ impies stronger privacy guarantee and a arger perturbation noise, and a higher vaue of ɛ impies a weaker guarantee whie possiby achieving higher accuracy. The neighboring databases D and D 2 differ on at most one user. Lapace Mechanism. Dwork et a. [2] show that ɛ-differentia privacy can be achieved by adding i.i.d. noise to query resut q(d): q(d) = q(d) + (Ñ,..., Ñz) (4) Ñ i Lap(, GS(q) ) for i =,..., z (5) ɛ 33

Symbo Description D/D Input database / Domain of a databases T k Set of records contributed by user u k in D D R/D R samped database / Domain of D R D G/D G HPA samped database / Domain of D G D E/D E HPA samped database / Domain of D E q / q Query of a item counts / Noisy output of q q 2/ q 2 Query of a edge counts / Noisy output of q 2 p/ p Popuarity vector for a items / Estimation of p M Max records per user aowed in D Max records per user aowed in D R and D G d Max records per user aowed in D E Tabe : Summary of notations where z represents the dimension of q(d). The magnitude of Ñ conforms to a Lapace distribution with mean and GS(q)/ɛ scae, where GS(q) represents the goba sensitivity [2] of the query q. The goba sensitivity is the maximum L distance between the resuts of q from any two neighboring databases. Formay, it is defined as foows: GS(q) = max D,D 2 q(d ) q(d 2). (6) Sensitivity Anaysis. Let M denote the maximum number of records any user coud contribute and D denote the domain of database D. Let q = {c,..., c m} output the item counts for every V i. Let q 2 = {c,, c,2,..., c m,h } output the edge counts for every V i and A j. The foowing emmas estabish the goba sensitivity of q and q 2, in order to protect the privacy of each individua user. The proof is quite straightforward thus omitted here for brevity. LEMMA (ITEM COUNTS SENSITIVITY). The goba sensitivity of q : D R m is M, i.e. GS(q ) = M. (7) LEMMA 2 (EDGE COUNTS SENSITIVITY). The goba sensitivity of q 2 : D R mh is M, i.e. GS(q 2) = M. (8) Composition.. The composition properties of differentia privacy provide privacy guarantees for a sequence of computations, which can be appied to mechanisms that require mutipe steps. THEOREM (SEQUENTIAL COMPOSITION [2]). Let A i each provide ɛ i-differentia privacy. A sequence of A i(d) over the dataset D provides ( i ɛi)-differentia privacy. 4. PROPOSED SOLUTIONS Beow we describe two samping-based soutions to privacy-preserving data anaytics. The notations used in the probem definition and our proposed soutions are summarized in Tabe. 4. Simpe Random Agorithm () Our first soution has been inspired by the fact that the maximum number of records contributed by each user, i.e. M, coud be rather arge in rea appications. For exampe, the Netfix user who contributed the most data submitted 7, reviews, as shown in Tabe 4. In fact, a user coud contribute as many records as the domain size, i.e. m, as in the tota number of movies on Netfix. As a resut of the arge magnitude of M, a very high perturbation noise is required to provide differentia privacy, according to the Lapace mechanism. Furthermore, the number of records contributed by each user can be unbounded for many appications, as a Figure 3: Outine of Agorithm Agorithm Simpe Random Agorithm () Input: raw dataset D, samping factor, privacy budget ɛ Output: sanitized answer q and q 2 /* Simpe Random Samping */ : D R 2: for k =,..., n do 3: T k σ uid=uk (D) /* T k : records of user u k */ 4: 5: if T k do D R D R Tk 6: ese do 7: 8: T k random sampe records from T k D R D R T k /* Generate Private Item Counts */ 9: q (D R) compute count c i(d R) for every i : Output q (D R) = q (D R) + Lap( ɛ ) m /* Generate Private Edge Counts */ : q 2(D R) compute count c i,j(d R) for every i, j 2: Output q 2(D R) = q 2(D R) + Lap( ɛ 2 ) mh user coud repeatedy check in at the same ocation or purchase the same product. In that case, M may not be known without breaching individua user privacy. In order to mitigate the effect of very arge or unbounded individua data contribution, we propose to sampe the raw input dataset D and aow up to records per user in the samped database. Therefore, the individua contribution to the samped database is bounded by the fixed constant. The aggregate statistics wi be generated from the samped data and then perturbed correspondingy in order to guarantee differentia privacy. The samping technique used in our soution is simpe random samping without repacement, after which our soution is named. An outine of the agorithm is provided in Figure 3. Given the input database D and a pre-defined samping factor, the method generates a samped database D R by random samping without repacement at most records for each user in input database D. The samped database D R coud be different every time the agorithm is run, due to the randomness of samping. However, it is guaranteed that for every possibe sampe D R, any user coud have no more than records. The foowing emma estabishes the sensitivity of q and q 2 under such constraint. LEMMA 3 (SAMPLE SENSITIVITY). In the domain of D R, it hods that GS(q ) = and GS(q 2) =. Subsequenty, the method computes the query answers to q and q 2 from the samped database D R, where a individua count queries c i and c i,j are evauated based on the data records in D R. According to the Lapace mechanism, it is sufficient to add perturbation noise from Lap( ɛ ) to each item count c i(d R) to guarantee ɛ -differentia privacy. Simiary, adding perturbation noise from Lap( ɛ 2 ) to each edge count c i,j(d R) guarantees ɛ 2- differentia privacy. The pseudocode of method is provided in Agorithm. 34

rid uid vid Day-Of-Week r Aice Gym Monday r 2 Aice Mary s house Tuesday r 3 Aice de Young Museum Friday r 4 Aice Goden Gate Bridge Saturday Tabe 2: Exampe Check-in Records To sum up, injects ow Lapace noise into reeased query resuts, due to reduced sensitivity in the samped database. However, the accuracy of reeased query resuts is affected by ony using D R, a subset of the input data D. Intuitivey, the more we sampe from each user, the coser q (D R) and q 2(D R) are to the true resuts q (D) and q 2(D), respectivey, at the cost of a higher Lapace perturbation error to achieve differentia privacy. Beow we formay anayze the trade-off between accuracy and privacy for query q to study the optima choice of. Simiar anaysis can be conducted for query q 2 and is thus omitted here for brevity. DEFINITION 4 (MEAN SQUARED ERROR). Let c i denote the noisy count reeased by q (D R) and c i denote the rea count computed by q (D), for each item V i. The Mean Squared Error of the noisy count c i is defined as foows: MSE( c i) = V ar( c i) + (Bias( c i, c i)) 2. (9) THEOREM 2. Given D R is a simpe random sampe of D and q (D R) = q (D R) + Lap( ɛ ) m, the vaue of that minimizes MSE is a monotonicay increasing function of ɛ 2. PROOF. See Appendix A. The above theorem provides a guideine to choose the vaue given the privacy budget ɛ : when the privacy budget is higher, we can afford to use more private data to overcome the error due to data oss; When privacy budget is imited, a sma number of data records shoud be taken from each user to reduce the perturbation error by the differentia privacy mechanism. 4.2 Hand-Picked Agorithm (HPA) Observing that a majority of data anaytica tasks depend on popuar paces or products, such as in traffic anaysis and recommendation services, data reated to popuar items shoud preferaby be preserved in the sampe database. In other words, some records generated by one user might be more usefu for data anaytics than the rest. The foowing exampe iustrates the concept of record usefuness. EXAMPLE. Tabe 2 iustrates Aice s check-in records in the raw database. Among the 4 paces Aice has been, de Young Museum and the Goden Gate Bridge are paces of interest and attract a arge number of visitors. On the other hand, gym and Mary s house are oca and persona to Aice and may not interest other peope. Therefore we consider r 3 and r 4 more usefu than r and r 2 for data anaytics. However, r and r 2 may be chosen by over r 3 and r 4, due to the simpe random samping procedure. From Exampe, it can be seen that r 3 and r 4 shoud be picked by the samping procedure over r and r 2, in order to generate meaningfu recommendation resuts. Therefore, we define the foowing popuarity-base utiity score for each private data record and propose to preserve records with highest scores for each user. DEFINITION 5 (UTILITY SCORE). Given record r and r.vid = V i, the utiity score of r is defined as foows: score(r) = p i () where p i represents the underying popuarity of item V i. Figure 4: Outine of HPA Agorithm rid uid vid Day-Of-Week Utiity r 4 Aice Goden Gate Bridge Saturday.2 r 3 Aice de Young Museum Friday. r Aice Gym Monday. r 2 Aice Mary s house Tuesday. Tabe 3: Exampe Check-in Records Sorted by Utiity Score Note that the record utiity can be defined in other ways according to the target anaytica questions. Our choice of the popuaritybased measure is motivated by the tasks of discovering popuar paces or products, as we as the fact that popuar items are ess persona/sensitive to individua users. In order to maximize the utiity of the samped database, we propose to greediy pick up to records with highest utiity scores for each user. Note that a user s records with the same score wi have equa chance to be picked. The outine of HPA is provided in Figure 4. Beow we describe () private estimation of record utiity and (2) greedy samping procedure. Popuarity Estimation. For each item V i, the popuarity p i represents the probabiity of any record r having r.vid = V i, which is often estimated by the reative frequency of such records. However, the estimation of p i s from the private user data must not vioate the privacy guarantee. We present our privacy-preserving utiity estimation in Agorithm 2 from Line to Line 7, which is outined in the upper haf of Figure 4. The utiity estimation is aso conducted on a samped database D E with samping factor d. D E is obtained by randomy choosing up to d records per user from the raw database D. We adopt randomy samping here because we do not have prior knowedge about the database at this point. The query q is computed based on D E and each count is perturbed with Lapace noise from Lap( d ɛ ). The perturbed counts { c i(d E)} are used to estimate the popuarity for each item V i by the foowing normaization: p i = max( c i(d E), ) m. () i= max( ci(de), ) Since the Lapace perturbation noise is a random variabe and therefore coud be negative, we repace the negative counts with s in computing item popuarity. The resuting p i is used to estimate the utiity score of each record r with r.vid = V i. The foowing emma estabishes the sensitivity of q and q 2 where each user can contribute up to d records. The proof is straightforward and is thus omitted. LEMMA 4. In the domain of D E, it hods that GS(q ) = d. Greedy Samping. The greedy samping procedure hand-picks up to records with highest utiity scores among each user s data in 35

Agorithm 2 Hand-Picked Agorithm (HPA) Input: raw dataset D, samping factor Output: sanitized answer q and q 2 /* Popuarity Estimation */ /* Random Sampe */ : D E 2: for k =,..., n do 3: T k σ uid=uk (D) /* T k : records of user u k */ 4: Random sampe d record from T k, add to D E /* Generate Private Item Counts */ 5: q (D E) compute count c i(d E) for every i 6: q (D E) = q (D E) + Lap( d ɛ ) m /* Estimate Popuarity */ 7: p normaize histogram q (D E) /* Greedy Samping */ 8: D G 9: for k =,..., n do : T k σ uid=uk (D) : if T k do 2: D G D G Tk 3: ese do 4: for record r in T k do 5: assign score(r) = p i iff r.vid = V i 6: T k pick records with highest scores from T k 7: D G D G T k /* Generate Private Item Counts */ 8: q (D G) compute count c i(d G) for every i 9: Output q (D G) = q (D G) + Lap( ɛ ) m /* Generate Private Edge Counts */ 2: q 2(D G) compute count c i,j(d G) for every i, j 2: Output q 2(D G) = q 2(D G) + Lap( ɛ 2 ) mh D. The pseudo code is provided in Agorithm 2 from Line 8 to Line 7. Tabe 3 iustrates Aice s records sorted by utiity score. Since Gym and Mary s House do not interest greater pubic, their scores are ikey to be much ower than Goden Gate Bridge" and de Young Museum. Then the top records on the sorted ist wi be put in the samped database D G. This step is performed on every user s data in the raw database D. LEMMA 5. In the domain of D G, it hods that GS(q ) = and GS(q 2) =. After the greedy samping step, the resuts to q and q 2 wi be computed on the samped database D G. Each individua item count and edge count wi be perturbed by Lapace noise from Lap( ɛ ) and Lap( ɛ 2 ), respectivey. We wi provide proof of privacy guarantee in the next section. The advantage of HPA is that it greediy picks the most vauabe data records from each user, without increasing the sampe data size, i.e. records per user. The utiity of each data record is estimated privatey from the overa data distribution. Records with high utiity have higher chance to be picked by greedy samping. Since the samped data greaty depends on the reative usefuness among each user s records, it is difficut to anayze the accuracy of reeased counts. We wi empiricay evauate the effectiveness of this approach in Section 6. 5. PRIVACY GUARANTEE In this section, we prove that both and HPA agorithms are differentiay private. We begin with the foowing emma, which states that record samping on each user does not infict differentia privacy breach. Users 2,579 45,289 48,89 6,4 Items 5,2 7,967 7,7 3,76 D 739,6,276,988,48,57,,29 max T k 4,38,33 7, 2,34 avg T k 58.8 6.2 29.2 65.6 min T k 2 Tabe 4: Data Sets Statistics LEMMA 6. Let A be an ɛ-differentiay private agorithm and S be a record samping procedure which is performed on each user individuay. A S is aso ɛ-differentiay private. PROOF. See Appendix B. THEOREM 3. satisfies (ɛ + ɛ 2)-differentia privacy. PROOF. Let S rand, denote the random samping procedure in. S rand, is therefore a function that takes an raw database and outputs a samped database, i.e. S rand, : D D R. According to the Lapace mechanism and Lemma 3, q : D R R m is ɛ -differentiay private. By the above Lemma 6, the item counts by, i.e. q S rand, : D R m is ɛ -differentiay private. Simiary, q 2 : D R R mh is ɛ 2-differentiay private. The edge counts by, i.e. q 2 S rand, : D R mh is aso ɛ 2-differentiay private. Therefore, the overa computation satisfies (ɛ + ɛ 2)-differentia privacy by Theorem. THEOREM 4. HPA satisfies (ɛ + ɛ + ɛ 2)-differentia privacy. PROOF. Let S rand,d denote the random samping procedure in HPA for popuarity estimation, i.e. S rand,d : D D E. Let S grd, denote the greedy samping procedure, i.e. S grd, : D D G. According to the Lapace mechanism and Lemma 4, q : D E R m is ɛ -differentiay private. By Lemma 6, the HPA popuarity estimation step, i.e. q S rand,d : D R m is ɛ -differentiay private. Simiary, we can prove that the HPA item counts q S grd, : D R m is ɛ -differentiay private, and the HPA edge counts q 2 S grd, : D R mh is ɛ 2-differentiay private. Therefore, by Theorem, the overa HPA satisfies (ɛ + ɛ + ɛ 2)-differentia privacy. 6. EXPERIMENTS Here we present a set of empirica studies. We compare our soutions and HPA with three existing approaches: ) LPA, the baseine method that injects Lapace perturbation noise to each count, 2) DFT, the Discrete Fourier Transform based agorithm proposed in [23], appied to a vector of counts, and 3) GS, the best method with grouping and smoothing proposed in [5], appied to count histograms. Given the overa privacy budget ɛ, we set ɛ,2 = ɛ ɛ for method, and ɛ = and ɛ,2 =.45ɛ for HPA 2 method. Without specuating about the optima privacy aocation, we set ɛ to a sma fraction of ɛ, because it is used to protect a sma sampe of private data for utiity score estimation. To achieve the same privacy guarantee, we appy LPA, DFT, and GS to item counts and edge counts separatey, with privacy budget ɛ for each 2 appication. Data sets. We conducted our empirica studies with four rea-word data sets referred to as Gowaa, Foursquare, Netfix, and Movie- Lens, each named after its data source. The first two data sets consist of ocation check-in records. Gowaa is coected among users based in Austin from Gowaa ocation-based socia network by 36

Berjani and Strufe [3] between June and October 2. Simiary, Foursquare is coected from Foursquare by Long et a. [8] between February and Juy 22. In these two data sets, each record contains a user, a ocation, and a check-in time-stamp. Since a user can check-in at one ocation many times, the check-in data sets can represent a cass of services which vaue the returning behavior, such as buying or browsing. The other two data sets consist of movie ratings, where a movie may not be rated more than once by a user. Netfix is the training data set for the Netfix Prize competition. MovieLens is coected from users of MovieLens website 2. Each rating corresponds to a user, a movie, a rating score, and a time-stamp. Moreover, MovieLens aso provides user demographic information, such as gender, age, occupation, and zipcode. The properties of the data sets are summarized in Tabe 4. Note that the minimum individua contribution in MovieLens is 2, as opposed to for other data sets. This is because MovieLens was initiay coected for personaized recommendation, thus users with fewer than 2 records were excuded from the pubished data set. Setup. We impemented our and HPA methods, as we as the baseine LPA and DFT in Java.We obtained Jave code of GS from the authors of [5]. A experiments were run on a 2.9GHz Inte Core i7 PC with 8GB RAM. Each setting was run 2 times and the average resut was reported. The defaut settings of parameters are summarized beow: the overa privacy ɛ =, the samping parameter d for HPA popuarity estimation d = min T k, the samping parameter = for Gowaa, Foursquare, and Netfix and = 3 for MovieLens. Our choice of parameter settings is guided by anaytica resuts and minima knowedge about the data sets and thus might not be optima. For LPA and DFT, we set M to be equa to max T k. However, this vaue may not be known a priori. Stricty speaking, M is unbounded for check-in appications. In this sense, we overestimate the performance of LPA and DFT. 6. HPA-Private Popuarity Estimation We first examine the private popuarity estimation step of HPA method regarding the abiity to discover top- popuar items from the noisy counts q (D E). Reca that D E is generated by randomy samping d records per user and the output of q (D E) is then perturbed with noise from Lap( d ɛ ) to guarantee privacy. Given a sma privacy budget ɛ, it is ony meaningfu to choose a sma d vaue for accuracy, according to Theorem 2. Therefore, we set d equa to the minimum individua contribution, i.e. min T k, in every data set. In this experiment, we sort a items according to q (D E) output and items with highest noisy counts are evauated against the ground truth discovered from the raw data set. Figure 5 reports the precision resuts with various vaues on Foursquare and Netfix data. As can be seen, from the output of q (D E), we are abe to discover more than 6% of top-2 popuar ocations in Foursquare and 7% top-2 popuar movies on Netfix. When ooking at =, the output of q (D E) captures 4% of the rea popuar ocations and amost 8% popuar movies. We concude that HPA popuarity estimation provides a soid step stone for subsequent greedy samping, at very sma cost of individua data as we as privacy. 6.2 Impact of Samping Factor Here we ook at the upper bound of individua data contribution required by our soutions and study its impact on the accuracy of q and q 2 output. Mean Squared Error(MSE) is adopted as the 2 http://movieens.org MSE % 8% 6% 4% 2% q(de ) 2 4 6 8 4 3.5 3 2.5 2.5 (a) Foursquare % 8% 6% 4% 2% q(de ) 2 4 6 8 (b) Netfix Figure 5: Estimation of Item Popuarity by HPA 5 x 4 4.5 HPA min error =3 2 4 6 8 (a) MSE of q MSE 9 8 7 6 HPA min error =5 2 4 6 8 (b) MSE of q 2 Figure 6: Impact of with Foursquare Data Set metric for accuracy and is cacuated between the noisy output by our methods and the true resuts of q and q 2 from the raw input data D. We ran our and HPA methods varying the vaue of, in order to generate samped database D R and D G with different sizes. Figure 6(a) summarizes the resuts from Foursquare data for item counts, i.e. q, and Figure 6(b) for edge counts, i.e. q 2. In both figures of Figure 6, when vaue increases, the MSE of the noisy output by our methods first drops as samped database gets arger. For exampe, we observe a decreasing trend of MSE as is raised to 3 in Figure 6(a) and as is raised to 5 in Figure 6(b). Beyond these two points, when further increasing, the MSE grows due to the perturbation noise from Lap( ɛ ). Ceary, there is a trade-off between sampe data size and the perturbation error. The optima vaue of depends on actua data distribution and the privacy parameter ɛ, according to Theorem 2. This set of resuts show that both and HPA achieve minimum MSE with reativey sma vaues, i.e. = 3 for q and = 5 for q 2. Our findings in Theorem 2 are confirmed and we concude that choosing a sma upper bound on individua data contribution is beneficia especiay when privacy budget is imited. 6.3 Comparison of Methods Here we compare our and HPA methods with existing approaches, i.e. LPA, DFT, and GS on a data sets. The utiity of item counts and edge counts reeased by a private mechanisms are evauated with three metrics. Note that for Gowaa, Foursquare, and Netfix data, each edge connects an item with a day-of-week, from Monday to Sunday. For MovieLens data set, each edge connects a movie with a (Gender, Age) pair. The domain of Gender is { M, F } and the domain of Age is { Under 25, 25-34, Above 34 }. Beow we review the resuts regarding the reeased item counts and edge counts, for each utiity metric. Mean Squared Error (MSE). This metric provides a generic utiity comparison of different methods on the reeased counts. Figure 7(a) and Figure 8(a) summarize the MSE resuts for item counts and edge counts, respectivey. As can be seen, the baseine LPA yieds the highest error in both item counts and edge counts. The 37

MSE.E+9.E+8.E+7.E+6.E+5.E+4.E+3.E+2.E+.E+ (a) Mean Squared Error L-divergence 9 8 7 6 5 4 3 2 (b) L-Divergence 2% % 8% 6% 4% 2% % (c) Top- Figure 7: Utiity of Reeased Item Counts MSE.E+9.E+8.E+7.E+6.E+5.E+4.E+3.E+2.E+.E+ (a) Mean Squared Error L-divergence 9 8 7 6 5 4 3 2 % 8% 6% 4% 2% % (b) L-Divergence (c) Average Top- Figure 8: Utiity of Reeased Edge Counts GS method, as studied in the origina work [5], is no worse than DFT in every case except for MovieLens item counts. Our methods and HPA provide the owest MSE error except in three cases, i.e. Netfix item counts and MovieLens item/edge counts. This can be interpreted by the high average user contribution in these two data sets, where our methods infict more data oss by imiting individua data in the samped database. L-divergence. The L-divergence is a common metric widey used to measure the distance between two probabiity distributions. In this set of experiments, we consider the item/edge counts as data record distributions over the domain of items/edges. Both the reeased counts and origina counts are normaized to simuate probabiity distributions. Note that prior to that, zero or negative counts are repaced with. for continuity without generating many fase positives. We compute the L-divergence of the reeased distribution with respect to the origina distribution for each query and present the resuts in Figure 7(b) and Figure 8(b). The reeased distributions by LPA are further from origina data distributions than those of other methods for every data set. As expected, DFT and GS preserve the count distributions we in genera, because: ) the DFT method is designed to capture major trends in data series, and 2) the GS method generates smooth distributions by grouping simiar coumns. However, in severa cases, those two methods fai to provide simiar distributions, e.g. on Gowaa and Netfix data. We beieve that their performance depends on the actua data distribution, i.e. whether significant trend or near-uniform grouping exists and can be we extracted. On the other hand, our soutions and HPA provide comparabe performance to the best existing methods, athough not optimized to preserve distributiona simiarities. Furthermore, constanty outperforms HPA in approximating the true distributions, thanks to the nature of simpe random samping technique. Top- Discovery. In this set of experiments, we examine the quaity of top- discovery retrieved by a privacy-preserving mechanisms. For item counts, the top- popuar items are evauated. For edge counts, the top- popuar items associated with each attribute vaue are evauated and the average precision is reported, to simuate discoveries for each day-of-week and each user demographic group. In Figure 7(c), we observe that existing methods fai to preserve the most popuar items in any dataset. The reason is the baseine LPA suffers from high perturbation error, and DFT and GS yied over-smoothed reeased counts and thus cannot distinguish the most popuar items from those ranked next to them. When is arge enough, we wi see that their performance in top- discovery sowy recovers in a subsequent experiment. On the other hand, our methods and HPA greaty outperform existing approaches and HPA even achieves % precision for Netfix data. Simiary, our methods show superior performance in Figure 8(c), with the absoute precision sighty dropped due to sparser data distributions. Overa, HPA outperforms by preserving user records with high popuarity scores. The ony exception where is better than HPA is in finding the top- most popuar movies on MovieLens. The reason is that those users who contribute ess than 2 records were excuded from the data set and no movies were preferred by the majority of the rest users. As for finding top- movies for each demographic group, HPA greaty improves over, since users within a demographic group show simiar interests. We further ook at top- precision of the reeased item counts by a methods, with ranging from to. The resuts are provided in Figure 9. We can see that the performance of our greedy approach HPA is % when = and drops as increases, since the samping step ony picks a sma number of records, i.e. records, from each user with highest utiity score, i.e. item popuarity. Our random approach aso shows decreasing precision as increases, due to the data oss caused by samping. However, the decreasing rate is much sower compared to that of HPA, because records of a user have equa chance to be seected by random samping. On the contrary, LPA, DFT, and GS show % precision when = and higher precision as increases. We concude that and HPA can discover the most popuar items, superior to existing approaches up to =, but do not distinguish ess popuar items due to ack of information in the samped database. 38

.8.6.4.2 HPA LPA DFT GS 2 3.8.6.4.2 (a) Gowaa HPA LPA DFT GS 2 3 (c) Netfix.8.6.4.2 HPA LPA DFT GS 2 3.8.6 (b) Foursquare.4 HPA LPA.2 DFT GS 2 3 (d) MovieLens Figure 9: Comparison of Methods: Top- Mining The existing approaches fai to distinguish the most popuar items, e.g. top-, because of perturbation or the smoothing effect of their methods, but might provide good precision for arge, e.g.. 6.4 Additiona Benefits Data Reduction. One beneficia side effect of imiting individua data contribution is the reduction of data storage space by generating anaytics from a samped database. Figure shows the number of records in the samped databases used by and HPA compared to that of the raw input. As can be seen, the samped data is much smaer than the raw input for every data set. For Netfix data set, our methods perform privacy-preserving anaytics and generate usefu resuts on sampe databases with ess than 5% of the origina data, reducing the data storage requirement without compromising the utiity of output anaytics. Weeky Distribution. We aso examine the samped database by and HPA by the weeky distribution of data records. The percentage of Foursquare check-in records on each day of week is potted in Figure. As is shown, the percentage of Friday, Saturday, and Sunday check-ins is higher in the samped databases generated by our methods than in the origina data set, whie the percentage of Monday-Thursday check-ins is ower than the origina. Since the majority of the users are occasiona users and contribute ess than records, our methods preserve their data competey in the samped databases. We may infer that the occasiona users are more ikey to use the check-in service on Friday-Sunday. Moreover, the samped data is constanty coser to the origina data distribution, compared to HPA. We can further infer that users are more ikey to check-in popuar paces on Friday-Sunday. Movie Recommendation. A exampe of context-aware, fine-grained recommendation is to suggest items based on the common interest demonstrated among the user group with simiar demographics, such as age and gender. We iustrate the top- movie recommendation to mae users under the age of 25 with reeased edge counts by our soutions on MovieLens data set. The first coumn in Tabe 5 shows the top- recommended movies using origina data, whie the second and third coumns ist movies recommended by our privacy-preserving soutions. We observe that some movies NumbermofmRecords Percentage.E+9.E+8.E+7.E+6.E+5.E+4 25% 2% 5% % 5% % SampedmData RawmData Figure : Data Reduction orgina HPA SUN MON TUE WED THU FRI SAT Weekday Figure : Weeky Distribution with Foursquare Data Top Movies Output HPA Output American Beauty Phantasm II American Beauty Star Wars VI Marvin s Room Star Wars VI Star Wars V A Dogs Go to Heaven Terminator 2 The Matrix In the Line of Duty 2 Star Wars V Star Wars IV Star Wars V Jurassic Park Terminator 2 The Sumber Party Massacre III The Matrix Saving Private Ryan The Story of Xinghua Men in Back Jurassic Park American Beauty The Fugitive Star Wars I Shaft Braveheart Braveheart Star Wars I Saving Private Ryan Tabe 5: Movie Recommendations to Mae, Under 25. recommended by may not interest the target audience, such as Marvin s Room and The story of Xinghua. Furthermore, the top movie on ist, i.e. Phantasm II, is a horror movie and not suitabe for underage audience. On the other hand, the movies recommended by HPA are quite consistent with the origina top- except for two movies, i.e. Men in Back" and The Fugitive", which may interest the target audience as we. We beieve that HPA captures more information by greedy samping and thus can make better recommendations than, especiay when users have very diverse interests. 7. CONCLUSION AND DISCUSSION We have proposed a practica framework for privacy-preserving data anaytics by samping a fixed number of records from each user. We have presented two soutions, i.e. and HPA, which impement the framework with different samping techniques. Our soutions do not require the input data be preprocessed, such as removing users with arge or itte data. The output anaysis resuts are highy accurate for performing top- discovery and contextaware recommendations, cosing the utiity gap between no privacy and existing differentiay private techniques. Our soutions benefit from samping techniques that reduce the individua data contribution to a sma constant factor,, and thus reducing the perturbation error inficted by differentia privacy. We provided anaysis resuts about the optima samping factor with respect to the privacy requirement. We formay proved that both mechanisms satisfy ɛ-differentia privacy. Empirica studies with rea-word data sets confirm that our soutions enabe accurate data anaytics on a 39

sma fraction of the input data, reducing user privacy cost and data storage requirement without compromising utiity. Potentia future work may incude the design of a hybrid approach between and HPA which coud have the benefits of both. For rea-time appications, we woud ike to consider how to dynamicay sampe user generated data, in order to further reduce the data storage requirement. Another direction is to appy the proposed samping framework to soving more compex data anaytica tasks, which might invove mutipe, over-apping count queries or other statistica queries. 8. ACNOWLEDGMENTS We thank the anonymous reviewers for the detaied and hepfu comments to the manuscript. 9. REFERENCES [] M. Barbaro and T. Zeer. A face is exposed for ao searcher no. 447749. The New York Times, Aug. 26. [2] J. Bennett and S. Lanning. The netfix prize. In Proceedings of DD cup and workshop, voume 27, page 35, 27. [3] B. Berjani and T. Strufe. A recommendation system for spots in ocation-based onine socia networks. In Proceedings of the 4th Workshop on Socia Network Systems, SNS, pages 4: 4:6, New York, NY, USA, 2. ACM. [4] A. Bum,. Ligett, and A. Roth. A earning theory approach to non-interactive database privacy. In Proceedings of the 4th annua ACM symposium on Theory of computing, pages 69 68, New York, 28. ACM. [5] T.-H. H. Chan, E. Shi, and D. Song. Private and continua reease of statistics. ACM Trans. Inf. Syst. Secur., 4(3):26: 26:24, Nov. 2. [6]. Chaudhuri and N. Mishra. When random samping preserves privacy. In Proceedings of the 26th annua internationa conference on Advances in Cryptoogy, CRYPTO 6, pages 98 23, Berin, Heideberg, 26. Springer-Verag. [7] R. Chen, G. Acs, and C. Casteuccia. Differentiay private sequentia data pubication via variabe-ength n-grams. In Proceedings of the 22 ACM conference on Computer and communications security, CCS 2, pages 638 649, 22. [8] G. Cormode, C. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentiay private summaries for sparse data. In Proceedings of the 5th Internationa Conference on Database Theory, ICDT 2, pages 299 3, New York, NY, USA, 22. ACM. [9] Y.-A. de Montjoye, C. A. Hidago, M. Vereysen, and V. D. Bonde. Unique in the Crowd: The privacy bounds of human mobiity. Scientific Reports, Mar. [] C. Dwork. Differentia privacy. In M. Bugiesi, B. Prenee, V. Sassone, and I. Wegener, editors, Automata, Languages and Programming, voume 452 of Lecture Notes in Computer Science, pages 2. Springer Berin Heideberg, 26. [] C. Dwork,. enthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourseves: privacy via distributed noise generation. In Proceedings of the 24th annua internationa conference on The Theory and Appications of Cryptographic Techniques, EUROCRYPT 6, pages 486 53, Berin, Heideberg, 26. Springer-Verag. [2] C. Dwork, F. Mcsherry,. Nissim, and A. Smith. Caibrating noise to sensitivity in private data anaysis. In In Proceedings of the 3rd Theory of Cryptography Conference, pages 265 284, Heideberg, 26. Springer-Verag. [3] L. Fan and L. Xiong. An adaptive approach to rea-time aggregate monitoring with differentia privacy. nowedge and Data Engineering, IEEE Transactions on, 26(9):294 26, Sept 24. [4] Y. Hong, J. Vaidya, H. Lu, and M. Wu. Differentiay private search og sanitization with optima output utiity. In Proceedings of the 5th Internationa Conference on Extending Database Technoogy, EDBT 2, pages 5 6, New York, NY, USA, 22. ACM. [5] G. earis and S. Papadopouos. Practica differentia privacy via grouping and smoothing. In Proceedings of the 39th internationa conference on Very Large Data Bases, PVLDB 3, pages 3 32, 23. [6] A. oroova,. enthapadi, N. Mishra, and A. Ntouas. Reeasing search queries and cicks privatey. In Proceedings of the 8th internationa conference on Word wide web, WWW 9, pages 7 8, 29. [7] N. Li, W. Qardaji, and D. Su. On samping, anonymization, and differentia privacy or, k-anonymization meets differentia privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASIACCS 2, pages 32 33, 22. [8] X. Long, L. Jin, and J. Joshi. Towards understanding traveer behavior in ocation-based socia networks. In Goba Communications Conference (GLOBECOM), 23 IEEE, 23. [9] A. Machanavajjhaa, D. ifer, J. Abowd, J. Gehrke, and L. Vihuber. Privacy: Theory meets practice on the map. In Data Engineering, 28. ICDE 28. IEEE 24th Internationa Conference on, pages 277 286, 28. [2] F. McSherry. Privacy integrated queries: an extensibe patform for privacy-preserving data anaysis. voume 53, pages 89 97, 2. [2]. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and samping in private data anaysis. In Proceedings of the thirty-ninth annua ACM symposium on Theory of computing, STOC 7, pages 75 84, New York, NY, USA, 27. ACM. [22] D. Proserpio, S. Godberg, and F. McSherry. Caibrating data to sensitivity in private data anaysis: A patform for differentiay-private anaysis of weighted datasets. Proc. VLDB Endow., 7(8):637 648, Apr. 24. [23] V. Rastogi and S. Nath. Differentiay private aggregation of distributed time-series with transformation and encryption. In Proceedings of the 2 ACM SIGMOD Internationa Conference on Management of data, pages 735 746, 2. [24] G. Yuan, Z. Zhang, M. Winsett, X. Xiao, Y. Yang, and Z. Hao. Low-rank mechanism: optimizing batch queries under differentia privacy. Proc. VLDB Endow., 5():352 363, Juy 22. [25] C. Zeng, J. F. Naughton, and J.-Y. Cai. On differentiay private frequent itemset mining. Proc. VLDB Endow., 6():25 36, Nov. 22. 32

APPENDIX A. PROOF OF THEOREM 2 PROOF. For item V i, et c i denote the true count computed by q from the sampe D R. Therefore, the noisy count c i is derived by adding a Lapace noise to c i as foows: The MSE of c i can be re-written as: c i = c i + ν i, (2) ν i Lapace(, /ɛ ). (3) MSE( c i) = V ar(c i + ν i) + (E(c i + ν i c i)) 2 = V ar(c i) + V ar(ν i) + (E(c i) E(c i)) 2. (4) Note that c i and ν i are mutuay independent. Let p i denote the popuarity of item V i, i.e. the probabiity of any record having vid = V i. For simpicity, we assume that users are mutuay independent, records are mutuay independent, and every user has M records in the raw data set D. To obtain D R, records out of M are randomy chosen for each user in D. Thus for any item V i, c i can be represented as the sum of independent random variabes: n c i = δ r,i (5) k= r T k { if r.vid = Vi & r D δ r,i = R, (6) otherwise. The event of δ r,i = is equivaent to the event of record r is about V i and r is samped in D R by chance: P r[δ r,i = ] = P r[r.vid = V i & r D R] = p i M. (7) Therefore, we can obtain the foowing expectation and variance for c i: n E(c i) = E(δ r,i) (8) r T k k= = n p i k= r T k M = np i We concude that the optima vaue is a monotonicay increasing function of ɛ 2. B. PROOF OF LEMMA 6 PROOF. By definition of differentia privacy, we are to prove that for any neighboring raw databases D and D 2, A S satisfies the foowing inequaity for D Range(A S): P r[a S(D ) = D] e ɛ P r[a S(D 2) = D]. (23) Without oss of generaity, we assume D 2 contains one more user than D. Let u denote the user that is contained in D 2 but not D and T be user u s set of records in D 2. By definition of neighboring databases, we can rewrite D 2 = D T 3. Let ˆD denote any possibe samping output of S(D ). We have: P r[a S(D ) = D] = ˆD = ˆD P r[a S(D ) = D S(D ) = ˆD ]P r[s(d ) = ˆD ] P r[a( ˆD ) = D]P r[s(d ) = ˆD ] (24) Let ˆT denote any possibe samping output of S(T ). We note that ˆT can take vaues from the entire domain, in genera: P r[s(t ) = ˆT ] =. (25) ˆT Since S is performed independenty on each user, we can derive: P r[s(d ) = ˆD ] = ˆT = ˆT P r[s(d ) = ˆD ]P r[s(t ) = ˆT ] P r[s(d T ) = ˆD ˆT ]. (26) Note that since D and T are disjoint, the samping output on D and T are aso independent and disjoint. Therefore, P r[a S(D ) = D] = ˆD P r[a( ˆD ) = D] ˆT P r[s(d T ) = ˆD ˆT ] V ar(c i) = = n n k= p i k= r T k r T k V ar(δ r,i) (9) ( pi M M ) = np i( p i M ) Simiary, we can obtain the expectation of c i: E(c i) = nmp i. (2) From the above resuts, we can re-write Equation 4 as foows: MSE( c i) = np i( p i M ) + 2 2 + (np i nmp i) 2 (2) and we can perform the standard east square method to minimize the MSE. The optima vaue is thus: ɛ 2 2n 2 p 2 i M np i = 4/ɛ 2 2np2 i /M + 2n2 p 2 i (22) = ˆD, ˆT P r[a( ˆD ) = D]P r[s(d T ) = ˆD ˆT ] ˆD, ˆT = e ɛ ˆD2 e ɛ P r[a( ˆD ˆT ) = D]P r[s(d T ) = ˆD ˆT ] (27) P r[a( ˆD 2) = D]P r[s(d 2) = ˆD 2] (28) = e ɛ P r[a S(D 2) = D]. (29) Line 27 is due to the fact that A is ɛ-differentiay private and ˆD and ˆD ˆT are neighboring databases. In ine 28 we change notation and et ˆD 2 represent ˆD ˆT. The proof is hence compete. 3 is used to denote a co-product, or disjoint union of two databases. 32