A Case Study on User Safety and Privacy Protection

Transcription

1 國立中正大學通訊工程學系碩士班碩士論文 A Novel Time-Obfuscated Algorithm for Trajectory Privacy 指導教授 : 黃仁竑博士研究生 : 鐘浩維中華民國一百零一年七月

2 ABSTRACT Location-based services (LBS) which bring so much convenience to our daily life have been intensively studied in the past years. Generally, a LBS query processing can be categorized into snapshot and continuous queries which search on user location information and reply searching results to the users. A LBS has full control to these location information, causing a user privacy concern. If a LBS has a malicious intention to infer the user privacy by tracking the user routes to their destinations, it incurs a serious problem. Most existing techniques addressed privacy protection mainly for snapshot queries. However, providing privacy protection for continuous queries is more important and challenging since a malicious LBS can easily obtain a complete user privacy information by observing a sequence of successive query requests. In this thesis, we propose a comprehensive trajectory privacy technique and combines ambient conditions to cloak location information based on the user privacy profile to avoids a malicious LBS reconstructing a user trajectory. We first propose a r-anonymity concept which preprocesses a set of similar trajectories R to blur the actual trajectory of a user. We then combine k-anonymity with s road segments to protect the user privacy. We introduce a novel time-obfuscated technique which breaks the sequence of the query issuing time for a user to confuse the LBS from knowing the user trajectory by sending a query randomly from a set of locations residing at the different trajectories R. Despite the randomness incurring from the obfuscation process for providing a strong trajectory privacy protection, the experimental results show that our trajectory privacy technique maintains the correctness of the query results at a competitive computational cost. 1

3 Directory Chapter 1 INTRODUCTION Chapter 2 RELATED WORK.. 7 Chapter 3 SYSTEM OVERVIEW Definitions Privacy Factors Chapter 4 PRIVACY PROTECTION ALGORITHM The r-anonymity Algorithm The time-obfuscated Algorithm Chapter 5 EXPERIMENTS The r-anonymity Paradigm The k-anonymity and s-segment Paradigms Average Query Processing Time Number of Duplicate Queries Run Test.31 Chapter 6 CONCLUSIONS and FUTURE WORK...35 REFERENCES

4 Chapter 1 : INTRODUCTION With the development of GPS equipment, wireless communication technologies and personal mobile devices, location-based services (LBSs) provide location-aware services to users based on their location obtained from their smart phones. For example, foursquare [1] is a location-based social network service that provides users to search their friends, share information to each other, and provide check-in data. ShopAlerts [2], an advertisement application launched by AT&T, is another example of a location-based service. ShopAlerts is the first location-based advertisement service in the United States and provides their consumers favorable information and coupons to download from their mobile phones when they are near a specified geographic location. OnStar [3] service is a comprehensive LBS system that provides many services to the users such as emergency, navigation, tracking stolen vehicle, and POI searching services. LBSs collect user s location and personal information (e.g., user id) from the query message to retrieve corresponding answers. One major query type to support the location-based services is a continuous query which consists of a set of sequential point queries on the user trajectory. Since users do not know how LBSs manipulate their information, this poses a serious threat to users and raises a privacy concern. For example, a malicious LBS may track continuous queries to retrieve their diary trajectories, from which a LBS may infer their life style, home/work addresses, or even what clinic they had visited in the past. To protect the trajectory privacy for a user, prior research work proposed some solutions to blur an actual user trajectory. In [4], Xu et al. cloak the footprints on a user trajectory with (k-1) different footprints on the historical trajectories to satisfy the k-anonymity paradigm. In [5], Kido et al. utilized user trajectories to generate dummy trajectories to cloak the actual user trajectory. In [6], when performing a continuous query, to protect a user's identification from being disclosed by adversaries, Chow et al. utilized a k-sharing range issued from every query location to contain k identical users to avoid a malicious retrieval for the actual user trajectory. Unfortunately, existing techniques for protecting trajectory privacy have some weakness. Author Xu and Cai [4] adopted historical trajectories to satisfy the trajectory-based k-anonymity paradigm. However, when there is an empty set of users on an anonymous trajectory, a malicious LBS can easily speculate the queries sent by the user and track their trajectories to identify the user. Another issue is that current techniques do not consider the sequence of query issuing time for a continuous query, where each point query issued from a location on the user trajectory is received by a LBS in sequence beginning with the user s starting point and ending with the user s destination. The query location for each point query is transformed into a cloaked region to protect the identify. However, a malicious LBS 3

5 can still easily track the queries and eventually retrieve a possible user trajectory. For example in Figure 1, for both cases, R1 contains the POIs of a park and a shop. R2 contains a company and a mall. Each bold arrow line represents the most possible trajectory of a user. A LBS can calculate the probability of all possible paths from the POIs in R1 to those in R2 based on a user s velocity (i.e., direction and speed) and speculate a possible user path. In this example, the probability of four paths are 0.45, 0.2, 0.25, and 0.1, a malicious LBS may easily infer that bold arrow line is the most possible path of a user. Figure 1: Examples of the most possible path for a user. In this paper, we explore four privacy factors to solve the above-mentioned problems. First, for r-anonymity, we propose a preprocessing technique. When a user starts to plan a travelling route (i.e., a user trajectory), the trusted server obtains (r-1) history trajectories similar to the user trajectory from databases, which store all historical trajectories. If a LBS monitors every query issued by a user, it is less likely to infer the actual user trajectory and as a result the user trajectory is indistinguishable from (r-1) trajectories in a long-term. Second, for k-anonymity, the principle is based on the traditional k-anonymity paradigm where each query location must contain k users including the user such that for each point query, the user is indistinguishable from (k-1) users. However, k-anonymity has a serious problem in a high density area, since the range computed by the k-anonymity technique tends to be very small. Hence, it is easy to pin point where the user is located. Consequently, since we use the road network data, we consider the s-segment paradigm which requires that each query range must not only contain k users but also s road segments to avoid the k-anonymity problem. Finally, the fourth factor is time obfuscation. When a user moves toward his/her destination, the trusted anonymity server breaks the sequence of the query issuing time to confuse the LBS from knowing the user trajectory by randomly sending a query from a set of locations residing at the trajectories R. For example, in Figure 2, we first partition the map into grid cell and use a bold line to 4

6 enclose the cells covered by r trajectories and the line with arrow to illustrate a user travel route and direction. At T, T, T and T, the server uses the cells (39,73,97,131,220), (37,42,99,162,220), (70,72,97,160,221), and (41,67,99,192,250), respectively and encloses them in the queries. As time proceeds, both the query locations from the (r-1) trajectories and a user trajectory are randomly selected. In order to increase the cache usability, the system randomly selects query locations on the user trajectory and caches the results in the server. When processing T, T, and T, the server not only randomly selects the (r-1) trajectory location but also user trajectory location to increase the privacy level and cache usability. In case of T, the server uses the post location at cell 39 which already queried before to cheat the LBS to know the user actual location. Furthermore, we can see at T, although a user has never issued a query at T, the system continues the r-trajectory obfuscation process to confuse the LBS. T T T T Figure 2: Time obfuscation schematic diagram. 5

7 The main contributions of this work are described as follows: 1. We introduce a preprocessing technique r-anonymity which blurs a user actual trajectory with (r-1) historical trajectories to avoid a LBS from reconstructing the user actual trajectory by monitoring every query issued by the user in a long-term. 2. We adopt the k-anonymity collocated with s-segment techniques for every query location to avoid generating a small cloaked region. Furthermore, we consider real ambient conditions to increase the privacy level. 3. A time-obfuscated approach is proposed to randomize the sequence of the query issuing time by sending a query randomly from a set of locations residing at the trajectories R. 4. We conduct a series of experimental evaluations to verify our privacy model. Our experimental results show that the proposed model guarantees a high level of privacy. The rest of this thesis is organized as follows. In Section 2, we highlight the related work. We provide an overview of our system architecture and parameters in Section 3. In Section 4, we propose our privacy techniques in detail for trajectory privacy. Our experimental results are shown in Section 5 and we conclude our work and discuss the future work in Section 6. 6

8 Chapter 2 : RELATED WORK Previous techniques for location privacy mainly focus on snapshot queries. The most popular concept is to utilize k-anonymity to transform point location into a cloaked region which makes the user indistinguishable from other (k-1) users [7,8,9,10,11,12]. To solve the k-anonymity problem, [16] and [17] adopted the real-world conditions to avoid cloaking an area into a small region from which the actual user location can be easily identified. For continuous queries, Xu et al. [4] used (k-1) historical trajectories to satisfy the trajectory-based k-anonymity paradigm. Since a trajectory is decomposed into a series of footprints, the proposed KAT technique cloaks the footprints on the k trajectories including the user trajectory and each of the footprints are in accordance with the chronological order. However, when a user moves on the cloaked path, the LBS can still easily identify the user s actual location, if there is no other user on that path. Although the user trajectory is under the k-anonymity protection, this work does not consider the ambient conditions which may raise a lot of privacy concerns. Another technique is to use false dummy trajectories to protect user location privacy. In [5], Kido et al. proposed two methods, moving in a neighborhood and moving in a limited neighborhood, to generate more reasonable dummies locations, enabling the trace of user under the k-anonymity protection. In [13], Lei et al. utilized a rotation scheme to rotate a user trajectory and satisfy the distance deviation to make the user trajectory indistinguishable from the dummy trajectories. Furthermore, another k-intersected technique was proposed to decrease the probability of disclosing the user actual trajectory. The major issue of these dummy techniques is that when generating dummy locations or trajectories, they do not consider realistic locations. If a dummy location is generated on an unreasonable location, for example, on the river or mountains, the LBS can easy filter out these locations, making the dummy location useless. In [6], Chow and Mokbel proposed the k-sharing region method, they believe that each continuous query location must contain (k-1) identical users, such that the LBS cannot identify the user location by observing the entire continuous queries. The major issue of a k-sharing region is that when time goes on, users who are in the same cloaked area may move on different directions. As a result, the anonymity server may generate a very large cloaked area that incurs performance degradation. In [14], Meyerowitz and Choudhury designed a prediction engine to predict the future possible paths according to user past trajectories. On the prediction path, a trusted server sends queries to a LBS, and caches the result associated with the query locations. When a user approaches the location, the system restores the cached results for the user. In [14], the malicious LBS can easily identify the actual user trajectory when there is no other user in the prediction path. 7

9 The prior work does not take the time factor into consideration. As a result, when continuous queries are issued in a time sequence, a LBS can track all issued queries to specular possible trajectories. In this paper, we consider this privacy concern and propose a new technique to solve the above-mentioned problems. 8

10 Chapter 3 : SYSTEM OVERVIEW Figure 3 shows the system architecture. Mobile users communicate with a trusted anonymity server over an encrypted communication tunnel. The trusted anonymity server plays a secure role for mobile users who transmit their messages to semi-honest LBSs. The anonymity server collects the user trajectory information and stores the trajectory data in a database which contains historical trajectories of all users. Before a user starts to travel on his/her path, the trusted anonymity server uses the user s current location and the destination to select a trajectory with shortest path from the trajectory database for the user as the predicted trajectory; if no trajectory can be found from the database, a route planning algorithm based on the Dijkstra s algorithm is used to generate the predicted trajectory. Based on the predicted trajectory, the trusted server then generates (r-1) trajectories based on the r-anonymity policy. When a user travels on his/her trajectory, the anonymity server modifies each query location of the user into a cloaked region, changes the user id into a pseudo id which is encapsulated in the query message, and uses the time-obfuscated technique to send queries in a random sequence to confuse a LBS. These random queries include some point queries that the server has issued prior to the arrival of the user. When a LBS sends back the query results to user, the trusted server filters the results to retrieve the query results to the user and caches these results for the future uses, if the corresponding query location has not reached by the user. However, the cached results may be invalid at the time the user arrives and hence, we need to set a time-out time for every cached result. Furthermore, as the user travels to his/her destination, he/she may travels on a different path from the predicted trajectory. In such case, due to the r-anonymity policy, the server may still have some cached results which are useful for the user. In addition, the trusted server also updates the predicted trajectory according to the user s current location by either selecting a trajectory from the trajectory database or performing a route planning algorithm. These techniques ensure that the anonymity server guarantees not only the privacy protection but also the correctness of query results. 9

11 Figure 3: System architecture. 3.1 Definitions The definitions and symbols we use throughout the following sections are defined as follows. DEFINITION 1. A Trajectory Set R: R = {T, T,, T r }, where each T i, i 1 r is a trajectory composed by a set of footprints (i.e., a longitude-latitude position tracked by a GPS device on the way of a user moving toward the destination). First, we define a set of trajectories R which consists of a user trajectory T u and (r-1) historical trajectories obtained from a database. time t j. DEFINITION 2. A Trajectory T i : T i = *loc i t1, loc t2 i,, loc i tn i +, where loc tj is a footprint on trajectory T i at Second, a trajectory consists a set of footprints which are the positions tracked by a i GPS device. To achieve time obfuscation, a query processor randomly selects a loc tj on T i to issue a point query at a random time t j. In other words, the query processor does not periodically issue a query from a sequential location. 10

12 DEFINITION 3. A Query Message m i : m i = (u id, loc i i tj, k, s, resulttimeout, C), where u id is query user id. loc tj is a query location on trajectory T i at query issuing time t j. (k, s) is a parameters set defined in a user privacy profile which contains the k-anonymity and s-segment values. resulttimeout is a time-out out value for a valid query result. The user can set this value according to a query type and finally, C is the content of a query message. Third we define a query message as m i, where i indicates a trajectory id in the trajectory set R. Before sending a query message m i to a LBS, a trusted anonymity server modifies the m i to an obfuscated query message m i by modifying the u id w w to pseudo id and loc tj to CloRange tj DEFINITION 4. A Continuous Message Set M: M = *m, m,, m z +, where z is the number of total queries received by LBS. Finally, we define a continuous message set M received by a LBS from the query positions on the r trajectories. Each query message m i is issued from a query position (a footprint) on T i, which is also randomly selected from the trajectory set R. 3.2 Privacy Factors We explore four privacy factors in this paper: the r-anonymity, k-anonymity, s-segment and time-obfuscated techniques. The introduction of these factors are presented here. The r-anonymity Paradigm When a user starts to communicate with a LBS, the r parameter is set on the trusted anonymity server. Next, the trusted server searches the historical trajectories in a database to find the nearest and similar (r-1) trajectories to the user trajectory T u to blur an actual user trajectory into r trajectories, which decrease the probability of linking the user trajectory T u in a long-term. In addition, the trusted server can randomly issues queries from the query locations on the trajectories R to avoid a LBS from knowing the user trajectory T u by assembling each query location issued by user u. The k-anonymity Paradigm The trusted anonymity server complies the user privacy profile (k value) to generate a cloaked area which contains k users to achieve a protection. If a malicious LBS infers a user location, this technique successfully makes the user's identify indistinguishable from other (k-1) users. 11

13 The s-segment Paradigm This factor is adopted to solve the k-anonymity potential problem. As the user is in a high density area, using the k-anonymity principle to search for (k-1) users often makes a cloaked range very small. As a result, it is easy to expose the user actual location, even when the range contains k users. Thus, we combine the s-segment with k-anonymity paradigms which result in a cloaked region bounded with real world conditions. In summary, a trusted anonymity server first searches for k users in a cloaked region where there must contains s road segments. The time-obfuscated Technique When a user starts to travel on his/her way, the trusted anonymity server starts to randomly send queries to a LBS. Unlike the previous privacy techniques, our server blurs the sequence of query issuing time and randomly chooses a query location from one trajectory in R. For example, when user u is located at query position loc u t1, the anonymity server may issue a query from a u r position loc t9 or loc t5 from a trajectory T r R to confuse the LBS to know the current position of the user. This technique provides a strong privacy protection, prohibiting a LBS from disclosing a user starting location, destination and directions. Furthermore, the trusted server does not follow the chronological sequence of query issuing time to send the queries. We randomize each query issuing time to reduce the possibility of reconstructing a user trajectory by putting all query locations together. When receiving a query from an anonymity server, the LBS cannot distinguish a query sent by the user from a query (from one of trajectories R other than T u ) generated by the anonymity server. By combining these two methods, we maximize the obfuscation level for a continuous query and meanwhile, increase the caching usability. 12

14 Chapter 4 : PRIVACY PROTECTION ALGORITHMS The main structure of our privacy technique is shown in Figure 4. We divide the structure into three parts. Firstly, for the CellMap component, we partition the map into grid cells with a fixed length and width, and store the number of segments on each corresponding cell. After partition the map, our privacy algorithm is implemented and the query location information is retrieved based on the grid indexing structure. Secondly, for r-anonymity, the details are introduced in Section 4.1. If less than (r-1) trajectories are found, an anonymity server uses the virtual route planning system (e.g., Goolge map APIs) to generate the rest of trajectories to satisfy the user privacy profile. Finally, the trusted anonymity server uses the trajectories R as input data to perform a time-obfuscated algorithm and the details of the algorithm are introduced in Section 4.2. Figure 4: Main structure. 4.1 The r-anonymity Algorithm The pseudo code of the r-anonymity paradigm is shown in Algorithm 1. The inputs include a r value for the user privacy profile, historical trajectory sets, and a user trajectory T u. We use three arrays as the data structure to represent a trajectory: (1) a user list consisting of a list of register users in the system, (2) a user trajectory list, and (3) a cell-based footprint list. For the UserTrajectory structure, we use an array to represent it. The output of this algorithm is a two-dimensional array (termed r-trajectory array) storing the r trajectories. The first and second dimension store the trajectory index and a footprint list, respectively. Finally, we return the total number of trajectories to instruct the server to determine whether or not to perform virtual route planning mechanism to add more virtual trajectories. The r-anonymity algorithm is outlined in Algorithm 1. Beginning with the 13

15 r-anonymity process, we assign the user trajectory T u into the r-trajectory array and compute the length of the user trajectory for bounding the searching space for r trajectories. Then, the algorithm searches a database D for the historical trajectories by the following steps. First, in Lines 6-10, we check the length. If the historical trajectories are shorter than (UserLength * MIN_LENGTH) or longer than (UserLength * MAX_LENGTH), they are useless to help blurring the user trajectory. For example, if the trajectory only has a few or too many footprints, an adversary may link these r trajectories and reveal the r trajectories since an extremely short-length or long-length trajectory is not likely to be issued by a user in reality, such that an adversary can easily identify this synthetic trajectory. Thus, in our approach, we aim to generate (r-1) trajectories with a similar length to that of the user trajectory. Second, in Lines 11-14, the overlapping trajectories are checked. After filtering out the trajectories with unreasonable length with respect to the length of the user trajectory, we check T u and the trajectories which already have been assigned to the r-trajectory array to avoid duplicate trajectories. For example, if n trajectories in the r-trajectory array are duplicate paths after searching the r trajectories, the system uses locations on the duplicate trajectories to send queries. In such a case, the adversary may link these queries and reconstruct the whole possible trajectories. Eventually, an adversary may gather (r-n) trajectories, breaking the user privacy profile. Therefore, we set the minimum percentage for overlapping trajectories to generate the r trajectories. Third, in Line 15, we compute the distance variance to search for the nearest trajectory to the user trajectory T u. This mechanism helps the final (r-1) trajectories close to the user trajectory such that the r trajectories occupy fewer cells. Hence, the maintenance overhead for the grid indexing is reduced. 4.2 The time-obfuscated Algorithm The pseudo code of the time-obfuscated algorithm is shown in Algorithm 2. The inputs include two user privacy profile values k and s, a user trajectory T u, r trajectories and cache result time-out time. When a user starts to travel toward the distinction, an anonymity server adopts the time-obfuscated technique to randomly send queries including a query time at which the user does not issue a query. This method is performed until the user reaches the destination. At every random query time, the server first checks whether there are any time-out results cached in the server in Line 3. Since the anonymity server randomly select query locations on the r trajectories to issue queries, any results returned from the LBS is associated with an expiration time. When the results are expired, it is useless to the user. We need to remove these results and re-issue queries from these locations as the system proceeds. Lines 4-14 check a query time to determine whether or not to issue a 14

16 query for the user. If the system determines to send a query, it selects a user footprint and randomly chooses a query index. At any query issuing time, we aim to send more than one query to the LBS, because it causes the confusion to the LBS to identify the user. Furthermore, if the system determines not to send a query, the algorithm randomly decides whether to use a user footprint as a query location. Algorithm 1 r-anonymity Input: ( r, D, UserTrajectory ) Output: ( r-trajectory, NumberOfTrajectory ) 1: Assign UserTrajetory to r-trajectory 2: UserLength = LengthOfTrajectory( UserTrajectory ) 3: for ( i=0 TO r ) do 4: for ( j=0 TO totaluser ) do 5: for ( k=0 TO D ) do 6: TrajectoryLength = LengthOfTrajectory( D.T k ) 7: /* The trajectory must have a proper length to avoid retrieving 8: few footprints on a trajectory. */ 9: if (TrajectoryLength > (UserLength * MIN_LENGTH ) and 10: TrajectoryLength < ( UserLength * MAX_LENGTH ) then 11: Overlap = ComputeOverlap( r-trajectory, D.T k ) 12: /* If a historical trajectory overlaps too much with the user 13: trajectory, it reduces the blurring effect. */ 14: if ( Overlap < MAX_OVERLAP ) then 15: distvarance = DistVarience( UserTrajectory, D.T k ) 16: end if 17: end if 18: end for 19: end for 20: /* Find the smallest distance variance for obtaining the r trajectories. */ 21: r-trajectory = FindMinRVar( distvarance, D.T k ) 22: end for 23: return NumberOfTrajectory 15

17 We purposely add a random mechanism to determine a user footprint as a query location to increase the cache usability. If the user has not reached a location loc u u tj, but the anonymity server may still use loc tj as a query location to send query in advance. When the user arrives loc u tj, the anonymity server reuses the cached results to retrieve the query result for the user and therefore, the total number of queries is reduced. This function improves the server efficiency and enhances the user privacy level. Next, an anonymity server starts to send queries and checks the index for every query of the user. If a query belongs to the user, the server assigns the location on the user trajectory as a query location, and if the query is not sent by the user, it randomly chooses a location from r trajectories. After that, the anonymity server uses the query location to search for (k-1) users in Line 23, because a location is represented by a cell, the anonymity server just searches the users who are within the cell. If the cell does not contain enough (k-1) users, the anonymity server searches the neighbor cells until it satisfies the k-anonymity policy. Figure 5 shows an example of the search process. In this example, the query location is issued from cell 100. Since the users in this cell is less than k, the server then sorts the neighboring cells according to the number of users in descending order and start processing the cell with highest number of users until k users are completely gathered, as shown in Figure 5(a). If there are less than k users within the adjacent neighbor cells, it outwardly expanses the search range to search more cells until the number of user reaches k, as shown in Figure 5(b). After satisfying the k-anonymity policy, the anonymity server forms these cells into a range and checks this range to see whether it contains s road segments or not. If not, it uses the same way as we obtain for k users in Line 31, to search the neighbor cells until there are s segments. This process is shown in Figure 6. After satisfying the user privacy profile, the server uses this cloaked range as a query location and changes a user id into a pseudo id to send the query to a LBS. When an anonymity server receives the results from a LBS, it caches the results and assigns these query cells a time-out stamp in Line 35 to keep track of the up-to-date search results. 16

18 (a) Searching for first time. (b) Searching for second time. Figure 5: Process of searching k people. (a) Searching for first time. (b) Searching for second time. Figure 6: Process of searching s road segments. Algorithm 2 The time-obfuscated algorithm Input: ( k, s, UserTrajectory, r-trajectory, resulttimeout ) 1: issuetime = CurTime + random( MAX_QUERY_TIME ) 2: while ( CurTime == issuetime ) do 3: RemoveTimoutQuery( resulttimeout ) 4: if ( SearchQueryLoc( UserTrajectory ) returns true ) then 5: /* Randomly assign a query index within MAX_QUERY_NUM.*/ 6: queryindex = RanQueryIndex( MAX_QUERY_NUM ) 7: UserQueryCell = UserTrajectory 17

19 8: else 9: isquery = RanDecideUserQuery( UserTrajectory ) 10: if ( isquery == true ) then 11: queryindex = RanQueryIndex( MAX_QUERY_NUM ) 12: UserQueryCell = RanChooseLoc( UserTrajectory ) 13: end if 14: end if 15: /* Sending the queries */ 16: for ( j = 0 TO MAX_QUERY_NUM ) do 17: if ( queryindex == j ) then 18: QueryCell = UserQueryCell 19: else 20: QueryCell = RanChooseCell( r-trajectory ) 21: end if 22: /* Counting the number of users located at a QueryCell location. */ 23: usernum = ComputeUser( QueryCell ) 24: if (usernum < k 1 ) then 25: /*Searching on the neighbor cells until the number of users reaches k.*/ 26: CloakRange = SearchNeighborForK( QueryCell ) 27: end if 28: /* Checking the number of road segments in the query range. */ 29: Seg = CheckSegmentNum(CloakRange) 30: if ( Seg < s ) then 31: SearchNeighborForS( CloakRange ) 32: end if 33: SendQuery( CloakRange ) 34: /* Assigning a time-out time to the cloaked region. */ 35: AssignTimeout(CloakRange, ResultLiveTime ) 36: end for 37: Increment CurTime 38: issuetime = CurTime + random( MAX_QUERY_TIME ) 39: end while 18

20 Chapter 5 : EXPERIMENTS We use a real dataset downloaded from CRAWDAD [15] as our simulation data which contains mobility traces of taxi cabs in San Francisco, USA. This dataset contains GPS coordinates of approximately 500 taxis collected over 30 days in the San Francisco Bay Area, and we use 200 taxis as our simulation data. Since some taxis travel outside San Francisco Bay Area, we filter out these GPS coordinates beyond the bounding box represented by four latitude-longitude points, (37.81, ), (37.81, ), (37.7, ), (37.7, ) as shown in Figure 7. Figure 7: Filter area. In Table 1, we capture the 8560 road segments from Google map, and we set five different sizes to partition the map, the length of a cell is set to 100m, 200m, 300m, 400m and 500m. Therefore we obtain cells for a 100m x 100m grid, 4819 cells for a 200m x 200m, 2173 cells for 300m x 300m, 1240 cells for a 400m x 400m, and 660 cells for a 500m x 500m grid. Then we set the user privacy profile as follows. The r value is set to 5, 10, 15, and 20; the k value is set to 5, 10, 15, and 20; the s value is set to 10, 20, and 30. We use various parameters to conduct our simulations and for each simulation, there are ten different users at each experiment. Then we set time-out time to 1 minute, 2 minutes, 3 minutes, and 4 minutes for each result. As the descriptions are presented in Section 4.2, when cached results are expired, the server removes the old results. Finally, the number of POIs within each cell is randomly generated within a range of [0..n] where n is set differently for different cell size. Specifically, the pair of n and cell size are set as follows: (2,100m x 100m), (8,200m x 200m), (18,300m x 300m), (32,400m x 400m), and (50,500m x 500m). 19

21 Table 1: Experimental parameter setting. We implement two simple privacy techniques as the baseline algorithms for performance comparison. The first one is the (k+s)-anonymity technique which combines the k-anonymity and s-segment paradigms; the second one is (k+s+r)-anonymity technique which integrates the (k+s)-anonymity with the r-anonymity paradigm. We compare our privacy technique with these two approaches. The details of these two approaches are introduced in Section 5.3. We then improve our original time-obfuscated method by integrating it with the controlled query-location method. The idea is to limit distance between the location of randomly generated queries and the user s current location to reduce the timeout probability of the cached query results. In order to do so, we estimate the maximum travel distance of the user based on his/her current location and average traveling velocity. We use this distance to define a searching boundary with the approximate user location as the center and its maximum travel distance as the radius. In contrast to the original method that uses the entire locations residing in trajectories R, the server obtains only a set of query locations within this searching boundary and adopts these locations as the random query locations. As a consequence, the chance of timeout of cached result as well as the number of queries can be reduced. Our experiments use several metrics to compare these algorithms. 5.1 The r-anonymity Paradigm In this experiment, we testify the impact of the number of cells overlapped by the user trajectory. Generally, the system suffers increasing overhead cost as the number of processed cells increases. Figure 8 illustrates the percentage of overlapped cells with 10 user trajectories and the cell size, ranging from 100m to 500m. 20

22 Figure 8: Average of r trajectory cells. Figure 9: The r-anonymity processing time. We observe that when r increases, the number of overlapped cells steadily increases correspondingly. The maximum percentage of number of total cells is no more than 5%. The result shows that our approach does not incur a significant maintenance overhead to the server. Furthermore, we vary the cell size from 100m x 100m to 500m x 500m. For all cases, the number of overlapped cells increases as the cell size increases. Since we use a cell as a unit to represent a location information, there are less overlapped cells with the user trajectory when a cell size is set to 500m. In Figure 9, we show the CPU processing time vs. the r value. The performance is degraded because the overlap function needs to compare every user trajectory with the trajectories in the r-trajectory array for obtaining a set of identical trajectories. Since this stage is executed before a user starts to travel, it does not affect query processing time for the user. Next, we show cell distribution of the r trajectories before and after the r-trajectory cloaking in Figure 10. The black arrow line is used to represent a user trajectory and direction. We use the cell size of 200m x 200m as a partition size, and set k to 10 and s to 20 for the privacy profile. Before After Before (a) r = 5 (b) r = 10 After 21

23 Before After Before After (c) r = 15 (d) r = 20 Figure 10: The r-anonymity schematic diagram. In Figure 10(a)(b)(c)(d), we can observe that as r increases, the distribution of cells become a broader area (the shaded area), which is more likely to conceal the actual user trajectory from a LBS. After the query processing is completed, an adversary may link the entire queries to reconstruct the user trajectory. However, as we can see in Figures 10 (c) and (d), the user trajectory is hidden in these cells and it is almost impossible to reveal the original cells blurred by the r trajectories. These results prove that our r-anonymity guarantees to preserve user trajectory privacy in a long-term. 5.2 The k-anonymity and s-segment Paradigms We use three results to prove that our privacy technique guarantees a strong privacy protection while considering a user privacy profile Cloaked Regions We use two forms to evaluate the size of cloaked region. The first is the average region size of queries (in km ) and the second is the number of POIs. As shown in Figure 11, where we vary cell size to obtain different cloaked regions. We set s to 10, 20 and 30 and vary k from 5 to 20 to combine with different k 5, 10, 15, and 20. We fix the number of users to 10 when running each set of simulations. (a) s = 10 22

24 (b) s = 20 (a) s = 30 Figure 11: Cloaked regions. (a) 100m (b) 200m 23

25 (c) 300m (d) 400m (e) 500m Figure 12: Average number of POIs in cloaked region. As we can see in Figure 11, when k increases, the average region size becomes broader. However, the impact of s is not significant. In addition, when partitioning the map with a small cell size, the system achieves a better privacy protection. In overall, the average query area is under 1 km and a strong privacy protection is still maintained. We can use Figures 12(a-e) to derive more conclusions. In Figure 12, we verify that the number of POIs within the cloaked region is affected greatly by a k value, and less affected by s (number of road segments). Furthermore, the number of POIs returned by LBS increases when a cell size increases. Therefore, when a cell size is set to a small number, a better performance can be achieved The k-anonymity Paradigm As shown in Figure 13, we use different cell size and number of road segments to verify the number of users in a cloaked region which contains at least k users specified in the privacy profile. 24

26 (a) s = 10 (b) s = 20 (c) s = 30 Figure 13: Average users in cloaked region. 25

27 (a) s = 10 (b) s = 20 (c) s = 30 Figure 14: Maximum users in cloaked region. As we can see in all cases, the average number of users always exceeds k. In case of a cell size with 500m x 500m, the number of users is almost five more than k. Figure 14 shows the maximum number of the users, where we can observe that the maximum number of the users in the search results are almost 1.5 times as many as k. Notably when s is set to 30, the number of users in the results reaches three times as many as k=5. From these results, we can see that our approach satisfies the k-anonymity policy to support a strong privacy protection. 26

28 5.2.3 The s-segment Paradigm We show the average number of road segments found in the query location. In Figure 15, we vary the cell size and use different k to verify the impact of road segments. (a) k = 5 (b) k = 10 (c) k = 15 (d) k = 20 Figure 15: Average number of segments in cloaked region. As we can see, that when the s increases, the number of road segments remains virtually unchanged. When a query is issued, our privacy algorithm obtains k users first and subsequently, continues the process until the cloaked region contains at least s road segments. The results are as expected since a cell with a large size is very likely to cover more road segments and the cloaked region obtained by k-anonymity already contains more than 30 road segments. 5.3 Average Query Processing Time Since we use a time-obfuscated technique to improve the trajectory privacy, we show the performance of our privacy method by comparing it with two baseline privacy methods and one modified method that extends our original method. In (k+s)-anonymity approach, the privacy method only considers the k-anonymity and s-segment, without using the trajectories R to fuzzy the user trajectory. In addition, the (k+s)-anonymity approach sends queries in their original query issuing order. The (k+s+r)-anonymity approach considers the r-anonymity, k-anonymity, and s-segment. 27

29 However, the query issuing order is still not randomized. Finally, the (k+s+r+t)-anonymity is the core method proposed in this paper and additionally, the (k+s+r+ct)-anonymity approach uses the controlled query-location technique to enhance the (k+s+r+t)-anonymity method. We evaluate the query processing time complexity for these four methods in Table 2, where k, s, and r are the parameters in the privacy profile specified by a user, S n and R n are the number of the user queries and the number of queries using by the time-obfuscated method, H is the number of the historical trajectories stored in a database, F u and F h are the numbers of footprints on the user trajectory and on the historical trajectory, respectively. Finally, C is the number of cells overlapped with the r trajectories. In query processing time experiment, when r is set to 10, and as our discussion in Section 5.2.3, we can find that the segment factor has less impact on the performance. Thus, we evaluate the four methods by varying k. At each query issuing time, we set the MAX_QUERY_NUM to five. In our experiments, the query evaluation time (i.e., the total CPU time of performing the five queries) of the four methods are compared. Table 2: Complexity of four methods. (a) s = 10 28

30 (b) s = 20 (c) s = 30 Figure 16: Average query processing time. In Figure 16, we can see from these results, our method incurs higher CPU time, because we add the time-obfuscated technique which additionally searches the cells covered by the r trajectories. Furthermore, for each query location, since the user privacy profile must be satisfied, the server needs to maintain the cache data of the r trajectory at every query time. On the other hand, (k+s)-anonymity and (k+s+r)-anonymity send queries only when necessary; therefore S n is less than R n. That is, the baseline methods incur less number of queries. In our experiments, (k+s)-anonymity and (k+s+r)-anonymity methods not only send less than 5 queries at every query time but also send less number of queries than time-obfuscated technique, causing a better performance compared with our approach. However, our approach results in a better privacy protection. To solve the performance issue, we propose a novel architecture to solve the limit of the trusted anonymity server in our future work. An unexpected result we observed from Figure 16 is that the performance of the controlled query-location method is not improved significantly as compared to that of the (k+s+r+t)-anonymity approach. The reason is that the server needs to 29

31 compute a search boundary for every query to generate a randomized query location. As we can see in Table 2, the (k+s+r+ct)-anonymity approach additionally costs O(R n *(F u *C)) to search for a boundary at every query issuing time. 5.4 Number of Duplicate Queries As we can see the results in the previous section, our approach results in more CPU time to process queries and our approach incurs more queries issued by the system for the purpose of a strong privacy protection. We first test the number of duplicate queries using baseline methods. We find all the results are null. As we can see in Section 5.3, since the query sequence of the baseline methods is set based on the user query sequence which the server only send queries when necessary. So we only present results of the original time-obfuscated method and the controlled query-location method to examine the server maintenance overhead. Figure 17 illustrates the number of duplicated queries sent by the server using the original method. If the duplicate queries are too many, the performance of the anonymity server is degraded. Figure 18 shows that the controlled query-location method improves the performance by reducing the number of duplicate queries. In our experiments, to test the impact of duplicate queries, the parameters r, k and s are set to 10, 10, and 20, respectively and we vary the cell size and expiration time. The number of user is set to 10, and we compute the average number of queries. Figure 17: Number of duplicated queries in using original method. 30

32 Figure 18: Number of duplicated queries in using controlled query-location method. In Figure 17, when the time-out time is extended, the number of duplicate queries is gradually reduced. Note that when the cell size is set less than 300m x 300m, the number of total queries is only twice as many as the number of queries without the time-obfuscated technique. The rationale is that when a cell size is relatively large, the total number of cells overlapped with the r trajectories becomes fewer, such that the anonymity server can only use fewer locations to randomly send queries. However, these results prove that our time-obfuscated technique does not increase overheads dramatically on the anonymity server and achieve a better privacy protection. In Figure 18, the results show that using controlled query-location technique can substantially improve the performance of the original approach. The average number of duplicated queries is almost under twice as many as the number of queries without the time-obfuscated technique. When the cell size is set to 100m x 100m and 200m x 200m, the results show this approach nearly send only one query during user travel time. Therefore, the controlled query-location method is efficient in query arrangement and truly reduces the server maintain overhead. 5.5 Run Test We investigate the impact of randomness of the query sequence for the time-obfuscated method. If the query sequence is in accordance with the user trajectory order, it may disclose the user trajectory direction and decrease the level of privacy protection, making privacy protection technique useless. Therefore, we adopt run test (Wald Wolfowitz test) to check the randomness of the query sequence generated by our time-obfuscated method. Firstly, we assign each cell id sequentially to each cell on the map from the west to the east and then from the 31

33 north to the south. Secondly, the location of query sequence is recorded down according to query time. Thirdly, a sequence of two-valued data (+ and -) is created as follows: if the query location id of the i th query is larger than that of the (i + 1) th query, the data is set to ʻʻ+ʼʼ; otherwise, it is set to ʻʻ-ʼʼ. Finally, the number of runs of the sequence is used to check for the randomness hypothesis based on the run test. All parameters r, k, s, and number of users are set to 10, the time-out time for the results is set to 1 minute, and the alpha value (α) for run test is set to 0.05 in simulations. For each simulation, we check for each user if the number of runs falls within the acceptance range of the run test. We have observed that for all users of all simulations, the randomness hypothesis is accepted. We use the baseline method, (k+s)-anonymity, to compare with our time-obfuscated method and the controlled query-location method. Each simulation result is chosen from one of the 20 simulations. Table 3 shows the results of each run value and acceptance range for the baseline method. We present the results of the original time-obfuscated method in Table 4 and the controlled query-location method in Table 5. Table 3: Run test results of baseline method. 32

34 Table 4: Run test results of original method. Table 5: Run test results of using controlled query-location method. In Table 3, we can see that almost all the run values are under the lower bound for the acceptance range and almost all the run values are only one run. The results indicate that the baseline method fails to randomize the query sequence and a malicious LBS can easily link the user s actual trajectory and moving direction. Hence, this method decreases the level of privacy protection. From Table 4, we can see that all the run values are within the accepted range. The results of run test indicate that our random query sequence creates a strong confusion to the LBS to avoid revealing the time sequence of query locations. Thus, our privacy protection method not only provides real-time correctness of the query 33

35 results to the user, but also prevents the LBS from knowing the user trajectory and moving direction. In Table 5, all the run values are also within the accepted range. The results of Table 5 indicate that the controlled query-location method is also able to randomize the query sequence. In overall, the controlled query-location method significantly enhance the performance of the original time-obfuscated technique, which not only decrease the server maintenance overhead as shown in Section 5.4, but also provide the same level of privacy protection like original time-obfuscated technique. 34

36 Chapter 6 : CONCLUSIONS and FUTURE WORK In this thesis, we have proposed a novel technique for trajectory privacy protection, which considers a long-term disclosure issue as well as the real-time ambient conditions. We allow user to specify their privacy profile and our method integrates the r-anonymity, k-anonymity and s-segment paradigms. We also introduce a time-obfuscated technique to increase the level of trajectory privacy protection dramatically. As the experimental results have been shown, our privacy technique is able to protect the user trajectory and prevent a LBS from reconstructing a user actual trajectory and the direction. Our method guarantees to satisfy the user privacy profile for providing a stronger privacy protection. In the experimental results, our technique achieves a competitive performance to the traditional approaches. A trusted anonymity server may encounter a single point of failure. Therefore, our future work is to develop the idea of personal data vault which behaves as user s personal trusted server. Specifically, all personal sensing data or location information could be stored in the user s personal data vault, which enable each user to completely control his/her data and implement privacy protection mechanism such as our proposed time-obfuscated method. We plan to investigate a peer-to-peer architecture for connecting personal data vaults based on user s on-line social relation (e.g., friends of a social networking site such as facebook) such that a group of personal data vaults behave as distributed trusted anonymity servers. 35