Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides useful insights that benefit business and society, it also brings big privacy concerns. To allow valuable data usage while preserving user privacy, data anonymization is often a precondition for analytics applications to access sensitive user data. As one of the most popular platforms for big data storage and processing, Apache Hadoop provides access control based on file permissions or access control lists, but currently does not support conditional access. In this paper we introduce application and user privacy policies and a policy-based pre-processor to Hadoop. This pre-processor can perform anonymization, filtering, aggregation, encryption and other modifications to sensitive data before they are given to an application. By integrating policy based pre-processing into the MapReduce framework, privacy and security mechanisms can be applied on a per-application basis to provide higher flexibility and efficiency. Keywords: privacy; policy; authorization; conditional access; anonymization, pre-processing; Hadoop. 1. Introduction In the era of big data, massive amount of data are collected, processed and analyzed to provide insights that could benefit different aspects of our society, including communication, business, science, and healthcare. However big data also brings big privacy concerns. User privacy is at risk if their data or data about them are not properly protected. To allow valuable data usage while preserving user privacy, data anonymization is often a precondition for analytics applications to access sensitive user data such as location, medical records, financial transactions, etc. As one of the most popular platforms for big data storage and processing, Apache Hadoop [1] needs to protect data from unauthorized access. So far Hadoop provides access control based on file permissions or access control lists (ACL) which specify who can do what on the associated data. It is currently not possible to specify the condition under which access can be granted. Nor does Hadoop provide native support for data preprocessing (anonymization, filtering, aggregation, encryption, etc.) in order to satisfy access conditions. In this paper we introduce application and user specific privacy policies and a policy-based pre-processor to Hadoop to support data pre-processing and conditional access. The rest of the paper is organized as follows. In Section 2 we review the authorization mechanisms currently employed by Hadoop. In Section 3 we introduce application and user specific privacy policies to enable more flexible access control. Per-application pre-processing based on these policies is described in detail in Section 4. Advantages and disadvantages of the proposal are also discussed in this section. Finally, in Section 5 we draw conclusions and indicate the direction for future work. 2. Overview of Hadoop Authorization Mechanisms Apache Hadoop is an open source framework that provides distributed storage and processing of large data sets on clusters of commodity computers. Originally designed without security in mind, Hadoop did not authenticate users nor enforce access control. With the increased adoption of Hadoop for big data processing and analytics, security vulnerabilities and privacy risks have been pointed out and corresponding enhancements made [2],[3]. Hadoop Distributed File System (HDFS) [4] and HBase [5], the basic data storages for MapReduce [6] batch processing, have employed authorization mechanisms to protect data privacy. 2.1 HDFS For access control, HDFS uses UNIX-like permission bits for read, write and execute operations. Each file and directory has three sets of such permission bits associated to owner, group and others, respectively. Before an application can open a file for reading or writing, the Name Node checks the relevant permission bit to see whether the application (or more precisely the user on whose behalf the application is running) is allowed to do so. 2.2 HBase In HBase, the Access Controller co-processor running inside each Region Server intercepts every access request and consults an in-memory ACL table to allow or deny the requested operation. The ACL table regulates read/write/create/admin permissions on table, column family (CF) and column (qualifier) level, but not on cell level. For cell level permission checking, HBase has recently introduced percell ACL that is stored as a cell tag. The Access Controller evaluates the union of table/cf ACL and cell level ACL to decide whether the cell in question can be accessed. 3. Privacy Policies As seen in Section 2, HDFS file permissions and HBase ACLs specify who can do what on the associated data. They can be used to define simple privacy policies but do not provide good support for more complex and flexible policies. One important feature that is lacking is the ability to specify the condition under which access can be granted. For example it is not possible to allow a statistical application to access online shopping records under the condition that user identifiers have been removed. As users are becoming more and more privacy concerned, they should be given the possibility to specify their own ASE 2014 ISBN: 978-1-62561-003-4 1

privacy policies. Continue with the previous example, some users may choose not to participate in the statistics at all. The statistical application should therefore not be allowed to access the online shopping records of those users even if their identities are not disclosed. Assuming all users online shopping records are stored together in a big table in which each row contains information about one transaction of one user, we cannot give the statistical application access to all data items in a column, for instance the goods name column. Thus access permissions on file, table, or column level are not enough in this case. HBase s cell level ACLs may be used to reflect individual users policy settings, but they have some drawbacks. In a big table each user may have a large number of records. Since a user s policy regarding a specific type of data is unlikely to change from record to record, the same policy will be stored as per-cell ACL repeatedly, wasting storage space. As mentioned earlier, cell level ACL is stored as a cell tag together with the data. It has to be read from the storage before the Access Controller co-processor can make an authorization decision. This is much slower than reading the ACL table from the local memory. Reading the same policy again and again from different cells further wastes system resources. To solve the above problems we add two types of privacy policies to Hadoop: application policy and user policy. 3.1 Application Policies As the name implies application policies are perapplication. Depending on application properties (purpose, core or non-core business, internal or external application, etc.), relevant legal requirements, organization and business rules should be used to define application polices. For example one legal requirement could be that private data collected for core business (e.g. customer care) is not allowed to be used for non-core business (e.g. marketing) without explicit user consent. Assuming user consent has been obtained, a business rule could impose further restriction that the same data set as used for customer care must be anonymized before it is used for marketing purposes. Accordingly, anonymization, aggregation, encryption or other data modifications should be put as a precondition in application policies for analytics applications to access sensitive user data. For each application that will be run on the Hadoop cluster, a policy is defined and it includes the following information: Application id The types of data the application is allowed to access (or not allowed to access) If access is allowed only under certain conditions: o If the condition is anonymization: the anonymization algorithm and necessary parameters o If the condition is aggregation: the aggregation method and the data types to be aggregated A user policy flag indicating whether users can have individual privacy policies for this application A flag indicating whether the new data version created after pre-processing should be deleted after usage Other information e.g. encryption algorithm can also be specified in the application policy if needed. Application policies are securely stored in a policy database and loaded to the memory of the policy-based preprocessor when it starts up. 3.2 User Policies User policies are defined according to the privacy preferences of individual users. User policies may express deviations from application policies. For instance, there is an application that all users in the system are enrolled by default. The policy for this application allows it to access users location data. But a few users decide to opt out from this application. These users policies will then forbid that specific application to access their location data. In normal cases user policies take precedence over an application s policy, unless the application is so critical that it must have access to all users data. But in that circumstance users are most likely not given the possibility to opt out in the first place. Different from cell level ACLs, user policies are cached in the memory of each MapReduce Task Tracker for fast policy retrieval. We store the master copy of user policies in the ZooKeeper [7]. Every Task Tracker obtains a copy of the policies from the ZooKeeper and loads it in memory at startup time. Whenever a change is made to the user policies, ZooKeeper notifies the Task Trackers and ensures their local copies get updated. 4. Per-application Data Pre-Processing Different applications usually have different purposes, use different input data and produce different outputs. For them different legal requirements and/or business rules apply, as discussed in Section 3.1. In case the same data set is used by multiple applications, it has to be for instance anonymized in different ways for different applications. Even if the same anonymization algorithm can be used, the relevant parameters should be tuned according to the application specifics. So data pre-processing if required by privacy policies needs to be done on a per-application basis. For applications that are run periodically, as the input data has presumably changed from the last run, pre-processing has to be carried out again at each new run. To perform data pre-processing on a per-application basis, the following steps need to be taken for each application: 1. Based on legal requirements, organization and business rules, determine whether data pre-processing is required before the application is allowed to access the requested data. If so determine the necessary algorithm and parameters. 2. Do the determined pre-processing on the original data set and create a new version of the data set. 3. Give this new (e.g. anonymized) version of the data set, instead of the original one, as input to the application. ASE 2014 ISBN: 978-1-62561-003-4 2

4. When the application is finished, delete the new version if no longer needed. Manually carrying out the above process for each application will obviously incur a lot administrative overhead. Below we introduce a policy-based pre-processor to the Hadoop framework to make per-application data preprocessing an integral part of MapReduce job execution. 4.1 Policy-based Pre-Processor (PPP) As described in Section 3.1 application policies regulate which application can do what on which data, under which conditions. When an application is submitted as a job to the MapReduce Job Tracker, PPP examines the application policy associated to this application and if required by the policy initiates anonymization and/or other types of pre-processing to create a new version of the input data set. As this new version is created specifically for the application in question, the application should have full access to the content and no one else should have any access. The file permissions or ACLs for the new version are set accordingly (the settings on the original data set may not be valid after pre-processing and therefore not applicable to the new data set). PPP re-submits the job with this new version of data set as input. If necessary PPP deletes the new data set after the job is done. PPP initiated pre-processing is itself carried out in a MapReduce job, taking advantage of the parallel processing capability of Hadoop. We call such a job PPP job. A PPP job is given high priority to ensure that it will be picked up by the job scheduler for execution as early as possible. Running in a privileged account (e.g. the Hadoop user account), PPP jobs have unrestricted access to data stored in the Hadoop cluster. 4.2 Types of Pre-Processing A PPP job can perform any kind of data pre-processing, including anonymization, filtering, aggregation, encryption and other data modifications. If two or more types of preprocessing are to be done, the order between them should be carefully considered. Take the online shopping statistics application from Section 3 for example again, filtering of the goods name column based on user privacy policy (i.e. removing the goods name on those rows corresponding to users who do not want their data to be included in the statistics) must be performed before the user identifiers are removed or pseudo-anonymized. Pseudo-anonymization provides a simple means for privacy protection by replacing user identifiers in a data set with some artificial value. It has been widely used to pre-process data sets that contain sensitive information. However with the fast development in data mining and data analytics techniques, pseudo-anonymization has been proven not strong enough to prevent sophisticated re-identification attacks [8],[9]. More advanced anonymization techniques, for example k-anonymity [10], i-diversity [11] and t-closeness [12], may be needed. We assume that PPP has a library of MapReduce programs that implement different (pseudo-)anonymization, aggregation and encryption algorithms. They can also do filtering on given fields in a data set. 4.3 Integration into MapReduce In Hadoop, applications can access data in an HDFS file or a database table (HBase or Hive that is built on top of HDFS). For the ease of description, we assume that data is stored in tables with rows and columns. To enable per-application data pre-processing, we propose to modify the MapReduce job submission procedure as shown in Figure 1. Before a client submitted job is placed into the job queue, the Job Tracker interacts with PPP to perform necessary pre-processing on the input data, if required by the application policy. Below we describe the modified procedure step by step (as numbered in Figure 1). Figure 1. Modified MapReduce job submission procedure with PPP Step (1): The client (on whose behalf the application will run) sends a request to the Job Tracker for creating a new MapReduce job, including the job JAR file, the configuration file, input and output specifications. Table T is specified as the data input. In addition to these standard information, the application id S is also provided. Step (2): The application job request is forwarded to PPP which is a new component introduced to the Hadoop framework. Step (3): PPP retrieves the application policy for S from the memory and based on that identifies the data types S is allowed or not allowed to access and under which conditions. Using the table metadata of T, PPP maps the identified data types to the columns in T. Step (4): PPP creates a MapReduce job to pre-process the data in T, using an existing MapReduce program in the form of a JAR file that can perform anonymization (pseudo or more advanced anonymization), filtering, aggregation, encryption and other data modifications. PPP also specifies necessary parameters, e.g. the anonymization algorithm to use, the column(s) to be pseudo-anonymized, the column(s) to be filtered out, etc. based on the policy for S. If the user policy flag is set to TRUE, the MapReduce program will also do preprocessing based on user policies. Step (5): The PPP job is given high priority and put into execution. The PPP job outputs the modified data into a new table T. ASE 2014 ISBN: 978-1-62561-003-4 3

Step (6): When the PPP job is finished, PPP replaces T with T in the original job s input specification and sends it to the job queue. Step (7): When the application job is finished, PPP deletes table T if required by the policy for S. As a simple example we assume the application policy for a marketing service S states that S is allowed to access location data under the condition that user identifiers have been pseudoanonymized. The policy also specifies the pseudoanonymization algorithm to be hash of the user identifier. We further assume there are 3 users, Alice, Bob and Dan. Alice and Dan do not have any specific policy for their location data, while Bob has a policy forbidding his location data to be used by service S. Figure 2 shows the original input table to service S with subscriber names and location coordinates of all the 3 users: Figure 2. Original input table to service S After policy based pre-processing a new table is created and used as input to S instead. As shown in Figure 3, in this new table, the subscriber name column contains hash value and location columns only two user s latitude and longitude coordinates (Bob s data has been removed). Figure 3. New input table to service S 4.4 Advantages and Disadvantages of Policy-based Pre- Processing Policy-based pre-processing allows anonymization, filtering, aggregation, encryption and other privacy/security mechanisms to be applied on sensitive data before being accessed by applications, supporting conditional access. Compared to manually arranged pre-processing, the introduction of PPP makes policy-based pre-processing an integral part of Hadoop MapReduce, reducing administrative overhead. Because the pre-processing is per application, it is more dynamic and flexible in addressing application specifics. By filtering out data that is not needed in advance, the amount of data the application needs to deal with is reduced and runtime access control simplified. Furthermore, applications will not even have a chance to try to snoop on data they are not supposed to access, regardless of unintentional programming error or malicious code. An obvious disadvantage of the proposed policy-based preprocessing in MapReduce is that the execution of analytics application gets delayed. This is a price we have to pay to meet preconditions such as anonymization, encryption and aggregation that usually cannot be performed at the time the data being accessed. If only data filtering is required by application and user policies, it could probably be achieved by runtime access control without any pre-processing. On the other hand, since the pre-processing is itself carried out in a MapReduce job, it takes advantage of the parallel processing capability of Hadoop and therefore does not introduce very long latency. Another disadvantage is that the introduction of PPP affects the existing MapReduce entities. In order to support policybased pre-processing, the Job Tracker needs to interact with the PPP and keep it informed about the execution status of PPP jobs. The Task Trackers also need additional capability to handle user privacy policies. 5. Conclusions As a new source of energy for our modern society, big data could facilitate business innovations and economic growths. But there is risk that this new energy pollutes user privacy. To allow valuable data usage while preserving user privacy, data pre-processing such as anonymization and aggregation is often needed to hide sensitive information from analytics applications. In this paper we have introduced application and user specific privacy policies to Apache Hadoop big data platform. We have also introduced a dedicated pre-processor (PPP) that evaluates those privacy policies and arranges necessary data pre-processing before releasing the data to analytics applications. By integrating policy based preprocessing into the MapReduce job submission procedure, privacy and security mechanisms can be applied on a perapplication basis to provide higher flexibility and efficiency. We have described the proposed policy-based preprocessing in the context of Hadoop version 1. In Hadoop version 2, a different resource management model is used and MapReduce entities changed. Our future work will focus on the integration of PPP into the new MapReduce application ASE 2014 ISBN: 978-1-62561-003-4 4

execution path, but we do not expect big changes to our proposal. References [1] Apache Hadoop, http://hadoop.apache.org/. [2] K. T. Smith, Big Data Security: The evolution of Hadoop s Security Model, http://www.infoq.com/articles/hadoopsecuritymodel, Aug. 2013. [3] D. Das, O. O Malley, S. Radia and K. Zhang, Adding Security to Apache Hadoop, Hortonworks Technical Report, Oct. 2011. [4] K. Shvachko, H. Kuang, S. Radia and R. Chansler, The Hadoop Distributed File System, Proc. of IEEE/NASA Goddard Conference on Mass Storage Systems and Technologies, 0:1 10, 2010. [5] Apache HBase, http://hbase.apache.org/. [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processingon large clusters, Proc. of the 6th Symposium on Operating Systems Design & Implementation (OSDI), pages 137 150, 2004. [7] Apache ZooKeeper, http://zookeeper.apache.org/. [8] A. Narayanan and V. Shmatikov, Robust De-anonymization of Large Sparse Datasets, Proc. of 2008 IEEE Symposium on Security and Privacy, May 2008. [9] L. Sweeney, Uniqueness of Simple Demographics in the U. S. Population, Carnegie Mellon University Technical Report LIDAP- WP4, 2000. [10] L. Sweeney, k-anonymity: A Model for Protecting Privacy, International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems, 10(5): 557 570, 2002. [11] A. Machanavajjhala, J. Gehrke, D. Kifer and M. Venkitasubramaniam, l-diversity: Privacy Beyond k-anonymity, Proc. of the 22nd International Conference on Data Engineering, 2006. [12] N. Li, T. Li and S. Venkatasubramanian, t-closeness: Privacy Beyond k-anonymity and l-diversity, Proc. of IEEE 23rd International Conference on Data Engineering, 2007. ASE 2014 ISBN: 978-1-62561-003-4 5