Policy-based Pre-Processing in Hadoop
|
|
|
- Vanessa Blanche Heath
- 10 years ago
- Views:
Transcription
1 Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden Abstract While big data analytics provides useful insights that benefit business and society, it also brings big privacy concerns. To allow valuable data usage while preserving user privacy, data anonymization is often a precondition for analytics applications to access sensitive user data. As one of the most popular platforms for big data storage and processing, Apache Hadoop provides access control based on file permissions or access control lists, but currently does not support conditional access. In this paper we introduce application and user privacy policies and a policy-based pre-processor to Hadoop. This pre-processor can perform anonymization, filtering, aggregation, encryption and other modifications to sensitive data before they are given to an application. By integrating policy based pre-processing into the MapReduce framework, privacy and security mechanisms can be applied on a per-application basis to provide higher flexibility and efficiency. Keywords: privacy; policy; authorization; conditional access; anonymization, pre-processing; Hadoop. 1. Introduction In the era of big data, massive amount of data are collected, processed and analyzed to provide insights that could benefit different aspects of our society, including communication, business, science, and healthcare. However big data also brings big privacy concerns. User privacy is at risk if their data or data about them are not properly protected. To allow valuable data usage while preserving user privacy, data anonymization is often a precondition for analytics applications to access sensitive user data such as location, medical records, financial transactions, etc. As one of the most popular platforms for big data storage and processing, Apache Hadoop [1] needs to protect data from unauthorized access. So far Hadoop provides access control based on file permissions or access control lists (ACL) which specify who can do what on the associated data. It is currently not possible to specify the condition under which access can be granted. Nor does Hadoop provide native support for data preprocessing (anonymization, filtering, aggregation, encryption, etc.) in order to satisfy access conditions. In this paper we introduce application and user specific privacy policies and a policy-based pre-processor to Hadoop to support data pre-processing and conditional access. The rest of the paper is organized as follows. In Section 2 we review the authorization mechanisms currently employed by Hadoop. In Section 3 we introduce application and user specific privacy policies to enable more flexible access control. Per-application pre-processing based on these policies is described in detail in Section 4. Advantages and disadvantages of the proposal are also discussed in this section. Finally, in Section 5 we draw conclusions and indicate the direction for future work. 2. Overview of Hadoop Authorization Mechanisms Apache Hadoop is an open source framework that provides distributed storage and processing of large data sets on clusters of commodity computers. Originally designed without security in mind, Hadoop did not authenticate users nor enforce access control. With the increased adoption of Hadoop for big data processing and analytics, security vulnerabilities and privacy risks have been pointed out and corresponding enhancements made [2],[3]. Hadoop Distributed File System (HDFS) [4] and HBase [5], the basic data storages for MapReduce [6] batch processing, have employed authorization mechanisms to protect data privacy. 2.1 HDFS For access control, HDFS uses UNIX-like permission bits for read, write and execute operations. Each file and directory has three sets of such permission bits associated to owner, group and others, respectively. Before an application can open a file for reading or writing, the Name Node checks the relevant permission bit to see whether the application (or more precisely the user on whose behalf the application is running) is allowed to do so. 2.2 HBase In HBase, the Access Controller co-processor running inside each Region Server intercepts every access request and consults an in-memory ACL table to allow or deny the requested operation. The ACL table regulates read/write/create/admin permissions on table, column family (CF) and column (qualifier) level, but not on cell level. For cell level permission checking, HBase has recently introduced percell ACL that is stored as a cell tag. The Access Controller evaluates the union of table/cf ACL and cell level ACL to decide whether the cell in question can be accessed. 3. Privacy Policies As seen in Section 2, HDFS file permissions and HBase ACLs specify who can do what on the associated data. They can be used to define simple privacy policies but do not provide good support for more complex and flexible policies. One important feature that is lacking is the ability to specify the condition under which access can be granted. For example it is not possible to allow a statistical application to access online shopping records under the condition that user identifiers have been removed. As users are becoming more and more privacy concerned, they should be given the possibility to specify their own ASE 2014 ISBN:
2 privacy policies. Continue with the previous example, some users may choose not to participate in the statistics at all. The statistical application should therefore not be allowed to access the online shopping records of those users even if their identities are not disclosed. Assuming all users online shopping records are stored together in a big table in which each row contains information about one transaction of one user, we cannot give the statistical application access to all data items in a column, for instance the goods name column. Thus access permissions on file, table, or column level are not enough in this case. HBase s cell level ACLs may be used to reflect individual users policy settings, but they have some drawbacks. In a big table each user may have a large number of records. Since a user s policy regarding a specific type of data is unlikely to change from record to record, the same policy will be stored as per-cell ACL repeatedly, wasting storage space. As mentioned earlier, cell level ACL is stored as a cell tag together with the data. It has to be read from the storage before the Access Controller co-processor can make an authorization decision. This is much slower than reading the ACL table from the local memory. Reading the same policy again and again from different cells further wastes system resources. To solve the above problems we add two types of privacy policies to Hadoop: application policy and user policy. 3.1 Application Policies As the name implies application policies are perapplication. Depending on application properties (purpose, core or non-core business, internal or external application, etc.), relevant legal requirements, organization and business rules should be used to define application polices. For example one legal requirement could be that private data collected for core business (e.g. customer care) is not allowed to be used for non-core business (e.g. marketing) without explicit user consent. Assuming user consent has been obtained, a business rule could impose further restriction that the same data set as used for customer care must be anonymized before it is used for marketing purposes. Accordingly, anonymization, aggregation, encryption or other data modifications should be put as a precondition in application policies for analytics applications to access sensitive user data. For each application that will be run on the Hadoop cluster, a policy is defined and it includes the following information: Application id The types of data the application is allowed to access (or not allowed to access) If access is allowed only under certain conditions: o If the condition is anonymization: the anonymization algorithm and necessary parameters o If the condition is aggregation: the aggregation method and the data types to be aggregated A user policy flag indicating whether users can have individual privacy policies for this application A flag indicating whether the new data version created after pre-processing should be deleted after usage Other information e.g. encryption algorithm can also be specified in the application policy if needed. Application policies are securely stored in a policy database and loaded to the memory of the policy-based preprocessor when it starts up. 3.2 User Policies User policies are defined according to the privacy preferences of individual users. User policies may express deviations from application policies. For instance, there is an application that all users in the system are enrolled by default. The policy for this application allows it to access users location data. But a few users decide to opt out from this application. These users policies will then forbid that specific application to access their location data. In normal cases user policies take precedence over an application s policy, unless the application is so critical that it must have access to all users data. But in that circumstance users are most likely not given the possibility to opt out in the first place. Different from cell level ACLs, user policies are cached in the memory of each MapReduce Task Tracker for fast policy retrieval. We store the master copy of user policies in the ZooKeeper [7]. Every Task Tracker obtains a copy of the policies from the ZooKeeper and loads it in memory at startup time. Whenever a change is made to the user policies, ZooKeeper notifies the Task Trackers and ensures their local copies get updated. 4. Per-application Data Pre-Processing Different applications usually have different purposes, use different input data and produce different outputs. For them different legal requirements and/or business rules apply, as discussed in Section 3.1. In case the same data set is used by multiple applications, it has to be for instance anonymized in different ways for different applications. Even if the same anonymization algorithm can be used, the relevant parameters should be tuned according to the application specifics. So data pre-processing if required by privacy policies needs to be done on a per-application basis. For applications that are run periodically, as the input data has presumably changed from the last run, pre-processing has to be carried out again at each new run. To perform data pre-processing on a per-application basis, the following steps need to be taken for each application: 1. Based on legal requirements, organization and business rules, determine whether data pre-processing is required before the application is allowed to access the requested data. If so determine the necessary algorithm and parameters. 2. Do the determined pre-processing on the original data set and create a new version of the data set. 3. Give this new (e.g. anonymized) version of the data set, instead of the original one, as input to the application. ASE 2014 ISBN:
3 4. When the application is finished, delete the new version if no longer needed. Manually carrying out the above process for each application will obviously incur a lot administrative overhead. Below we introduce a policy-based pre-processor to the Hadoop framework to make per-application data preprocessing an integral part of MapReduce job execution. 4.1 Policy-based Pre-Processor (PPP) As described in Section 3.1 application policies regulate which application can do what on which data, under which conditions. When an application is submitted as a job to the MapReduce Job Tracker, PPP examines the application policy associated to this application and if required by the policy initiates anonymization and/or other types of pre-processing to create a new version of the input data set. As this new version is created specifically for the application in question, the application should have full access to the content and no one else should have any access. The file permissions or ACLs for the new version are set accordingly (the settings on the original data set may not be valid after pre-processing and therefore not applicable to the new data set). PPP re-submits the job with this new version of data set as input. If necessary PPP deletes the new data set after the job is done. PPP initiated pre-processing is itself carried out in a MapReduce job, taking advantage of the parallel processing capability of Hadoop. We call such a job PPP job. A PPP job is given high priority to ensure that it will be picked up by the job scheduler for execution as early as possible. Running in a privileged account (e.g. the Hadoop user account), PPP jobs have unrestricted access to data stored in the Hadoop cluster. 4.2 Types of Pre-Processing A PPP job can perform any kind of data pre-processing, including anonymization, filtering, aggregation, encryption and other data modifications. If two or more types of preprocessing are to be done, the order between them should be carefully considered. Take the online shopping statistics application from Section 3 for example again, filtering of the goods name column based on user privacy policy (i.e. removing the goods name on those rows corresponding to users who do not want their data to be included in the statistics) must be performed before the user identifiers are removed or pseudo-anonymized. Pseudo-anonymization provides a simple means for privacy protection by replacing user identifiers in a data set with some artificial value. It has been widely used to pre-process data sets that contain sensitive information. However with the fast development in data mining and data analytics techniques, pseudo-anonymization has been proven not strong enough to prevent sophisticated re-identification attacks [8],[9]. More advanced anonymization techniques, for example k-anonymity [10], i-diversity [11] and t-closeness [12], may be needed. We assume that PPP has a library of MapReduce programs that implement different (pseudo-)anonymization, aggregation and encryption algorithms. They can also do filtering on given fields in a data set. 4.3 Integration into MapReduce In Hadoop, applications can access data in an HDFS file or a database table (HBase or Hive that is built on top of HDFS). For the ease of description, we assume that data is stored in tables with rows and columns. To enable per-application data pre-processing, we propose to modify the MapReduce job submission procedure as shown in Figure 1. Before a client submitted job is placed into the job queue, the Job Tracker interacts with PPP to perform necessary pre-processing on the input data, if required by the application policy. Below we describe the modified procedure step by step (as numbered in Figure 1). Figure 1. Modified MapReduce job submission procedure with PPP Step (1): The client (on whose behalf the application will run) sends a request to the Job Tracker for creating a new MapReduce job, including the job JAR file, the configuration file, input and output specifications. Table T is specified as the data input. In addition to these standard information, the application id S is also provided. Step (2): The application job request is forwarded to PPP which is a new component introduced to the Hadoop framework. Step (3): PPP retrieves the application policy for S from the memory and based on that identifies the data types S is allowed or not allowed to access and under which conditions. Using the table metadata of T, PPP maps the identified data types to the columns in T. Step (4): PPP creates a MapReduce job to pre-process the data in T, using an existing MapReduce program in the form of a JAR file that can perform anonymization (pseudo or more advanced anonymization), filtering, aggregation, encryption and other data modifications. PPP also specifies necessary parameters, e.g. the anonymization algorithm to use, the column(s) to be pseudo-anonymized, the column(s) to be filtered out, etc. based on the policy for S. If the user policy flag is set to TRUE, the MapReduce program will also do preprocessing based on user policies. Step (5): The PPP job is given high priority and put into execution. The PPP job outputs the modified data into a new table T. ASE 2014 ISBN:
4 Step (6): When the PPP job is finished, PPP replaces T with T in the original job s input specification and sends it to the job queue. Step (7): When the application job is finished, PPP deletes table T if required by the policy for S. As a simple example we assume the application policy for a marketing service S states that S is allowed to access location data under the condition that user identifiers have been pseudoanonymized. The policy also specifies the pseudoanonymization algorithm to be hash of the user identifier. We further assume there are 3 users, Alice, Bob and Dan. Alice and Dan do not have any specific policy for their location data, while Bob has a policy forbidding his location data to be used by service S. Figure 2 shows the original input table to service S with subscriber names and location coordinates of all the 3 users: Figure 2. Original input table to service S After policy based pre-processing a new table is created and used as input to S instead. As shown in Figure 3, in this new table, the subscriber name column contains hash value and location columns only two user s latitude and longitude coordinates (Bob s data has been removed). Figure 3. New input table to service S 4.4 Advantages and Disadvantages of Policy-based Pre- Processing Policy-based pre-processing allows anonymization, filtering, aggregation, encryption and other privacy/security mechanisms to be applied on sensitive data before being accessed by applications, supporting conditional access. Compared to manually arranged pre-processing, the introduction of PPP makes policy-based pre-processing an integral part of Hadoop MapReduce, reducing administrative overhead. Because the pre-processing is per application, it is more dynamic and flexible in addressing application specifics. By filtering out data that is not needed in advance, the amount of data the application needs to deal with is reduced and runtime access control simplified. Furthermore, applications will not even have a chance to try to snoop on data they are not supposed to access, regardless of unintentional programming error or malicious code. An obvious disadvantage of the proposed policy-based preprocessing in MapReduce is that the execution of analytics application gets delayed. This is a price we have to pay to meet preconditions such as anonymization, encryption and aggregation that usually cannot be performed at the time the data being accessed. If only data filtering is required by application and user policies, it could probably be achieved by runtime access control without any pre-processing. On the other hand, since the pre-processing is itself carried out in a MapReduce job, it takes advantage of the parallel processing capability of Hadoop and therefore does not introduce very long latency. Another disadvantage is that the introduction of PPP affects the existing MapReduce entities. In order to support policybased pre-processing, the Job Tracker needs to interact with the PPP and keep it informed about the execution status of PPP jobs. The Task Trackers also need additional capability to handle user privacy policies. 5. Conclusions As a new source of energy for our modern society, big data could facilitate business innovations and economic growths. But there is risk that this new energy pollutes user privacy. To allow valuable data usage while preserving user privacy, data pre-processing such as anonymization and aggregation is often needed to hide sensitive information from analytics applications. In this paper we have introduced application and user specific privacy policies to Apache Hadoop big data platform. We have also introduced a dedicated pre-processor (PPP) that evaluates those privacy policies and arranges necessary data pre-processing before releasing the data to analytics applications. By integrating policy based preprocessing into the MapReduce job submission procedure, privacy and security mechanisms can be applied on a perapplication basis to provide higher flexibility and efficiency. We have described the proposed policy-based preprocessing in the context of Hadoop version 1. In Hadoop version 2, a different resource management model is used and MapReduce entities changed. Our future work will focus on the integration of PPP into the new MapReduce application ASE 2014 ISBN:
5 execution path, but we do not expect big changes to our proposal. References [1] Apache Hadoop, [2] K. T. Smith, Big Data Security: The evolution of Hadoop s Security Model, Aug [3] D. Das, O. O Malley, S. Radia and K. Zhang, Adding Security to Apache Hadoop, Hortonworks Technical Report, Oct [4] K. Shvachko, H. Kuang, S. Radia and R. Chansler, The Hadoop Distributed File System, Proc. of IEEE/NASA Goddard Conference on Mass Storage Systems and Technologies, 0:1 10, [5] Apache HBase, [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processingon large clusters, Proc. of the 6th Symposium on Operating Systems Design & Implementation (OSDI), pages , [7] Apache ZooKeeper, [8] A. Narayanan and V. Shmatikov, Robust De-anonymization of Large Sparse Datasets, Proc. of 2008 IEEE Symposium on Security and Privacy, May [9] L. Sweeney, Uniqueness of Simple Demographics in the U. S. Population, Carnegie Mellon University Technical Report LIDAP- WP4, [10] L. Sweeney, k-anonymity: A Model for Protecting Privacy, International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems, 10(5): , [11] A. Machanavajjhala, J. Gehrke, D. Kifer and M. Venkitasubramaniam, l-diversity: Privacy Beyond k-anonymity, Proc. of the 22nd International Conference on Data Engineering, [12] N. Li, T. Li and S. Venkatasubramanian, t-closeness: Privacy Beyond k-anonymity and l-diversity, Proc. of IEEE 23rd International Conference on Data Engineering, ASE 2014 ISBN:
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Towards Privacy aware Big Data analytics
Towards Privacy aware Big Data analytics Pietro Colombo, Barbara Carminati, and Elena Ferrari Department of Theoretical and Applied Sciences, University of Insubria, Via Mazzini 5, 21100 - Varese, Italy
Big Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America
1 Top Ten Security and Privacy Challenges for Big Data and Smartgrids Arnab Roy Fujitsu Laboratories of America 2 User Roles and Security Concerns [SKCP11] Users and Security Concerns [SKCP10] Utilities:
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Like what you hear? Tweet it using: #Sec360
Like what you hear? Tweet it using: #Sec360 HADOOP SECURITY Like what you hear? Tweet it using: #Sec360 HADOOP SECURITY About Robert: School: UW Madison, U St. Thomas Programming: 15 years, C, C++, Java
MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration
MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration Hoi-Wan Chan 1, Min Xu 2, Chung-Pan Tang 1, Patrick P. C. Lee 1 & Tsz-Yeung Wong 1, 1 Department of Computer Science
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Fast, Low-Overhead Encryption for Apache Hadoop*
Fast, Low-Overhead Encryption for Apache Hadoop* Solution Brief Intel Xeon Processors Intel Advanced Encryption Standard New Instructions (Intel AES-NI) The Intel Distribution for Apache Hadoop* software
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK BIG DATA HOLDS BIG PROMISE FOR SECURITY NEHA S. PAWAR, PROF. S. P. AKARTE Computer
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Massive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Fast Data in the Era of Big Data: Twitter s Real-
Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation
WHAT S NEW IN SAS 9.4
WHAT S NEW IN SAS 9.4 PLATFORM, HPA & SAS GRID COMPUTING MICHAEL GODDARD CHIEF ARCHITECT SAS INSTITUTE, NEW ZEALAND SAS 9.4 WHAT S NEW IN THE PLATFORM Platform update SAS Grid Computing update Hadoop support
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems
IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems Proactively address regulatory compliance requirements and protect sensitive data in real time Highlights Monitor and audit data activity
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Privacy Policy. Introduction. Scope of Privacy Policy. 1. Definitions
Privacy Policy Introduction This Privacy Policy explains what information TORO Limited and its related entities ("TORO") collect about you and why, what we do with that information, how we share it, and
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Cloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
Unless otherwise stated, our SaaS Products and our Downloadable Products are treated the same for the purposes of this document.
Privacy Policy This Privacy Policy explains what information Fundwave Pte Ltd and its related entities ("Fundwave") collect about you and why, what we do with that information, how we share it, and how
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Comprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Performance Analysis of Book Recommendation System on Hadoop Platform
Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG
Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG 1 The Big Data Working Group (BDWG) will be identifying scalable techniques for data-centric security and privacy problems. BDWG s investigation
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
The Improved Job Scheduling Algorithm of Hadoop Platform
The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: [email protected]
Data Security in Hadoop
Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize
HareDB HBase Client Web Version USER MANUAL HAREDB TEAM
2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set
Database security André Zúquete Security 1 Advantages of using databases Shared access Many users use one common, centralized data set Minimal redundancy Individual users do not have to collect and maintain
A Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Project Proposal. Data Storage / Retrieval with Access Control, Security and Pre-Fetching
1 Project Proposal Data Storage / Retrieval with Access Control, Security and Pre- Presented By: Shashank Newadkar Aditya Dev Sarvesh Sharma Advisor: Prof. Ming-Hwa Wang COEN 241 - Cloud Computing Page
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Hadoop Distributed FileSystem on Cloud
Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, tsikudis}@csd.uoc.gr Abstract. The growing
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Trafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Introduction to Apache YARN Schedulers & Queues
Introduction to Apache YARN Schedulers & Queues In a nutshell, YARN was designed to address the many limitations (performance/scalability) embedded into Hadoop version 1 (MapReduce & HDFS). Some of the
Big Data Security. Kevvie Fowler. kpmg.ca
Big Data Security Kevvie Fowler kpmg.ca About myself Kevvie Fowler, CISSP, GCFA Partner, Advisory Services KPMG Canada Industry contributions Big data security definitions Definitions Big data Datasets
Privacy Techniques for Big Data
Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Fair Scheduler. Table of contents
Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
