Policy-based Pre-Processing in Hadoop

Size: px
Start display at page:

Download "Policy-based Pre-Processing in Hadoop"

Transcription

1 Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden Abstract While big data analytics provides useful insights that benefit business and society, it also brings big privacy concerns. To allow valuable data usage while preserving user privacy, data anonymization is often a precondition for analytics applications to access sensitive user data. As one of the most popular platforms for big data storage and processing, Apache Hadoop provides access control based on file permissions or access control lists, but currently does not support conditional access. In this paper we introduce application and user privacy policies and a policy-based pre-processor to Hadoop. This pre-processor can perform anonymization, filtering, aggregation, encryption and other modifications to sensitive data before they are given to an application. By integrating policy based pre-processing into the MapReduce framework, privacy and security mechanisms can be applied on a per-application basis to provide higher flexibility and efficiency. Keywords: privacy; policy; authorization; conditional access; anonymization, pre-processing; Hadoop. 1. Introduction In the era of big data, massive amount of data are collected, processed and analyzed to provide insights that could benefit different aspects of our society, including communication, business, science, and healthcare. However big data also brings big privacy concerns. User privacy is at risk if their data or data about them are not properly protected. To allow valuable data usage while preserving user privacy, data anonymization is often a precondition for analytics applications to access sensitive user data such as location, medical records, financial transactions, etc. As one of the most popular platforms for big data storage and processing, Apache Hadoop [1] needs to protect data from unauthorized access. So far Hadoop provides access control based on file permissions or access control lists (ACL) which specify who can do what on the associated data. It is currently not possible to specify the condition under which access can be granted. Nor does Hadoop provide native support for data preprocessing (anonymization, filtering, aggregation, encryption, etc.) in order to satisfy access conditions. In this paper we introduce application and user specific privacy policies and a policy-based pre-processor to Hadoop to support data pre-processing and conditional access. The rest of the paper is organized as follows. In Section 2 we review the authorization mechanisms currently employed by Hadoop. In Section 3 we introduce application and user specific privacy policies to enable more flexible access control. Per-application pre-processing based on these policies is described in detail in Section 4. Advantages and disadvantages of the proposal are also discussed in this section. Finally, in Section 5 we draw conclusions and indicate the direction for future work. 2. Overview of Hadoop Authorization Mechanisms Apache Hadoop is an open source framework that provides distributed storage and processing of large data sets on clusters of commodity computers. Originally designed without security in mind, Hadoop did not authenticate users nor enforce access control. With the increased adoption of Hadoop for big data processing and analytics, security vulnerabilities and privacy risks have been pointed out and corresponding enhancements made [2],[3]. Hadoop Distributed File System (HDFS) [4] and HBase [5], the basic data storages for MapReduce [6] batch processing, have employed authorization mechanisms to protect data privacy. 2.1 HDFS For access control, HDFS uses UNIX-like permission bits for read, write and execute operations. Each file and directory has three sets of such permission bits associated to owner, group and others, respectively. Before an application can open a file for reading or writing, the Name Node checks the relevant permission bit to see whether the application (or more precisely the user on whose behalf the application is running) is allowed to do so. 2.2 HBase In HBase, the Access Controller co-processor running inside each Region Server intercepts every access request and consults an in-memory ACL table to allow or deny the requested operation. The ACL table regulates read/write/create/admin permissions on table, column family (CF) and column (qualifier) level, but not on cell level. For cell level permission checking, HBase has recently introduced percell ACL that is stored as a cell tag. The Access Controller evaluates the union of table/cf ACL and cell level ACL to decide whether the cell in question can be accessed. 3. Privacy Policies As seen in Section 2, HDFS file permissions and HBase ACLs specify who can do what on the associated data. They can be used to define simple privacy policies but do not provide good support for more complex and flexible policies. One important feature that is lacking is the ability to specify the condition under which access can be granted. For example it is not possible to allow a statistical application to access online shopping records under the condition that user identifiers have been removed. As users are becoming more and more privacy concerned, they should be given the possibility to specify their own ASE 2014 ISBN:

2 privacy policies. Continue with the previous example, some users may choose not to participate in the statistics at all. The statistical application should therefore not be allowed to access the online shopping records of those users even if their identities are not disclosed. Assuming all users online shopping records are stored together in a big table in which each row contains information about one transaction of one user, we cannot give the statistical application access to all data items in a column, for instance the goods name column. Thus access permissions on file, table, or column level are not enough in this case. HBase s cell level ACLs may be used to reflect individual users policy settings, but they have some drawbacks. In a big table each user may have a large number of records. Since a user s policy regarding a specific type of data is unlikely to change from record to record, the same policy will be stored as per-cell ACL repeatedly, wasting storage space. As mentioned earlier, cell level ACL is stored as a cell tag together with the data. It has to be read from the storage before the Access Controller co-processor can make an authorization decision. This is much slower than reading the ACL table from the local memory. Reading the same policy again and again from different cells further wastes system resources. To solve the above problems we add two types of privacy policies to Hadoop: application policy and user policy. 3.1 Application Policies As the name implies application policies are perapplication. Depending on application properties (purpose, core or non-core business, internal or external application, etc.), relevant legal requirements, organization and business rules should be used to define application polices. For example one legal requirement could be that private data collected for core business (e.g. customer care) is not allowed to be used for non-core business (e.g. marketing) without explicit user consent. Assuming user consent has been obtained, a business rule could impose further restriction that the same data set as used for customer care must be anonymized before it is used for marketing purposes. Accordingly, anonymization, aggregation, encryption or other data modifications should be put as a precondition in application policies for analytics applications to access sensitive user data. For each application that will be run on the Hadoop cluster, a policy is defined and it includes the following information: Application id The types of data the application is allowed to access (or not allowed to access) If access is allowed only under certain conditions: o If the condition is anonymization: the anonymization algorithm and necessary parameters o If the condition is aggregation: the aggregation method and the data types to be aggregated A user policy flag indicating whether users can have individual privacy policies for this application A flag indicating whether the new data version created after pre-processing should be deleted after usage Other information e.g. encryption algorithm can also be specified in the application policy if needed. Application policies are securely stored in a policy database and loaded to the memory of the policy-based preprocessor when it starts up. 3.2 User Policies User policies are defined according to the privacy preferences of individual users. User policies may express deviations from application policies. For instance, there is an application that all users in the system are enrolled by default. The policy for this application allows it to access users location data. But a few users decide to opt out from this application. These users policies will then forbid that specific application to access their location data. In normal cases user policies take precedence over an application s policy, unless the application is so critical that it must have access to all users data. But in that circumstance users are most likely not given the possibility to opt out in the first place. Different from cell level ACLs, user policies are cached in the memory of each MapReduce Task Tracker for fast policy retrieval. We store the master copy of user policies in the ZooKeeper [7]. Every Task Tracker obtains a copy of the policies from the ZooKeeper and loads it in memory at startup time. Whenever a change is made to the user policies, ZooKeeper notifies the Task Trackers and ensures their local copies get updated. 4. Per-application Data Pre-Processing Different applications usually have different purposes, use different input data and produce different outputs. For them different legal requirements and/or business rules apply, as discussed in Section 3.1. In case the same data set is used by multiple applications, it has to be for instance anonymized in different ways for different applications. Even if the same anonymization algorithm can be used, the relevant parameters should be tuned according to the application specifics. So data pre-processing if required by privacy policies needs to be done on a per-application basis. For applications that are run periodically, as the input data has presumably changed from the last run, pre-processing has to be carried out again at each new run. To perform data pre-processing on a per-application basis, the following steps need to be taken for each application: 1. Based on legal requirements, organization and business rules, determine whether data pre-processing is required before the application is allowed to access the requested data. If so determine the necessary algorithm and parameters. 2. Do the determined pre-processing on the original data set and create a new version of the data set. 3. Give this new (e.g. anonymized) version of the data set, instead of the original one, as input to the application. ASE 2014 ISBN:

3 4. When the application is finished, delete the new version if no longer needed. Manually carrying out the above process for each application will obviously incur a lot administrative overhead. Below we introduce a policy-based pre-processor to the Hadoop framework to make per-application data preprocessing an integral part of MapReduce job execution. 4.1 Policy-based Pre-Processor (PPP) As described in Section 3.1 application policies regulate which application can do what on which data, under which conditions. When an application is submitted as a job to the MapReduce Job Tracker, PPP examines the application policy associated to this application and if required by the policy initiates anonymization and/or other types of pre-processing to create a new version of the input data set. As this new version is created specifically for the application in question, the application should have full access to the content and no one else should have any access. The file permissions or ACLs for the new version are set accordingly (the settings on the original data set may not be valid after pre-processing and therefore not applicable to the new data set). PPP re-submits the job with this new version of data set as input. If necessary PPP deletes the new data set after the job is done. PPP initiated pre-processing is itself carried out in a MapReduce job, taking advantage of the parallel processing capability of Hadoop. We call such a job PPP job. A PPP job is given high priority to ensure that it will be picked up by the job scheduler for execution as early as possible. Running in a privileged account (e.g. the Hadoop user account), PPP jobs have unrestricted access to data stored in the Hadoop cluster. 4.2 Types of Pre-Processing A PPP job can perform any kind of data pre-processing, including anonymization, filtering, aggregation, encryption and other data modifications. If two or more types of preprocessing are to be done, the order between them should be carefully considered. Take the online shopping statistics application from Section 3 for example again, filtering of the goods name column based on user privacy policy (i.e. removing the goods name on those rows corresponding to users who do not want their data to be included in the statistics) must be performed before the user identifiers are removed or pseudo-anonymized. Pseudo-anonymization provides a simple means for privacy protection by replacing user identifiers in a data set with some artificial value. It has been widely used to pre-process data sets that contain sensitive information. However with the fast development in data mining and data analytics techniques, pseudo-anonymization has been proven not strong enough to prevent sophisticated re-identification attacks [8],[9]. More advanced anonymization techniques, for example k-anonymity [10], i-diversity [11] and t-closeness [12], may be needed. We assume that PPP has a library of MapReduce programs that implement different (pseudo-)anonymization, aggregation and encryption algorithms. They can also do filtering on given fields in a data set. 4.3 Integration into MapReduce In Hadoop, applications can access data in an HDFS file or a database table (HBase or Hive that is built on top of HDFS). For the ease of description, we assume that data is stored in tables with rows and columns. To enable per-application data pre-processing, we propose to modify the MapReduce job submission procedure as shown in Figure 1. Before a client submitted job is placed into the job queue, the Job Tracker interacts with PPP to perform necessary pre-processing on the input data, if required by the application policy. Below we describe the modified procedure step by step (as numbered in Figure 1). Figure 1. Modified MapReduce job submission procedure with PPP Step (1): The client (on whose behalf the application will run) sends a request to the Job Tracker for creating a new MapReduce job, including the job JAR file, the configuration file, input and output specifications. Table T is specified as the data input. In addition to these standard information, the application id S is also provided. Step (2): The application job request is forwarded to PPP which is a new component introduced to the Hadoop framework. Step (3): PPP retrieves the application policy for S from the memory and based on that identifies the data types S is allowed or not allowed to access and under which conditions. Using the table metadata of T, PPP maps the identified data types to the columns in T. Step (4): PPP creates a MapReduce job to pre-process the data in T, using an existing MapReduce program in the form of a JAR file that can perform anonymization (pseudo or more advanced anonymization), filtering, aggregation, encryption and other data modifications. PPP also specifies necessary parameters, e.g. the anonymization algorithm to use, the column(s) to be pseudo-anonymized, the column(s) to be filtered out, etc. based on the policy for S. If the user policy flag is set to TRUE, the MapReduce program will also do preprocessing based on user policies. Step (5): The PPP job is given high priority and put into execution. The PPP job outputs the modified data into a new table T. ASE 2014 ISBN:

4 Step (6): When the PPP job is finished, PPP replaces T with T in the original job s input specification and sends it to the job queue. Step (7): When the application job is finished, PPP deletes table T if required by the policy for S. As a simple example we assume the application policy for a marketing service S states that S is allowed to access location data under the condition that user identifiers have been pseudoanonymized. The policy also specifies the pseudoanonymization algorithm to be hash of the user identifier. We further assume there are 3 users, Alice, Bob and Dan. Alice and Dan do not have any specific policy for their location data, while Bob has a policy forbidding his location data to be used by service S. Figure 2 shows the original input table to service S with subscriber names and location coordinates of all the 3 users: Figure 2. Original input table to service S After policy based pre-processing a new table is created and used as input to S instead. As shown in Figure 3, in this new table, the subscriber name column contains hash value and location columns only two user s latitude and longitude coordinates (Bob s data has been removed). Figure 3. New input table to service S 4.4 Advantages and Disadvantages of Policy-based Pre- Processing Policy-based pre-processing allows anonymization, filtering, aggregation, encryption and other privacy/security mechanisms to be applied on sensitive data before being accessed by applications, supporting conditional access. Compared to manually arranged pre-processing, the introduction of PPP makes policy-based pre-processing an integral part of Hadoop MapReduce, reducing administrative overhead. Because the pre-processing is per application, it is more dynamic and flexible in addressing application specifics. By filtering out data that is not needed in advance, the amount of data the application needs to deal with is reduced and runtime access control simplified. Furthermore, applications will not even have a chance to try to snoop on data they are not supposed to access, regardless of unintentional programming error or malicious code. An obvious disadvantage of the proposed policy-based preprocessing in MapReduce is that the execution of analytics application gets delayed. This is a price we have to pay to meet preconditions such as anonymization, encryption and aggregation that usually cannot be performed at the time the data being accessed. If only data filtering is required by application and user policies, it could probably be achieved by runtime access control without any pre-processing. On the other hand, since the pre-processing is itself carried out in a MapReduce job, it takes advantage of the parallel processing capability of Hadoop and therefore does not introduce very long latency. Another disadvantage is that the introduction of PPP affects the existing MapReduce entities. In order to support policybased pre-processing, the Job Tracker needs to interact with the PPP and keep it informed about the execution status of PPP jobs. The Task Trackers also need additional capability to handle user privacy policies. 5. Conclusions As a new source of energy for our modern society, big data could facilitate business innovations and economic growths. But there is risk that this new energy pollutes user privacy. To allow valuable data usage while preserving user privacy, data pre-processing such as anonymization and aggregation is often needed to hide sensitive information from analytics applications. In this paper we have introduced application and user specific privacy policies to Apache Hadoop big data platform. We have also introduced a dedicated pre-processor (PPP) that evaluates those privacy policies and arranges necessary data pre-processing before releasing the data to analytics applications. By integrating policy based preprocessing into the MapReduce job submission procedure, privacy and security mechanisms can be applied on a perapplication basis to provide higher flexibility and efficiency. We have described the proposed policy-based preprocessing in the context of Hadoop version 1. In Hadoop version 2, a different resource management model is used and MapReduce entities changed. Our future work will focus on the integration of PPP into the new MapReduce application ASE 2014 ISBN:

5 execution path, but we do not expect big changes to our proposal. References [1] Apache Hadoop, [2] K. T. Smith, Big Data Security: The evolution of Hadoop s Security Model, Aug [3] D. Das, O. O Malley, S. Radia and K. Zhang, Adding Security to Apache Hadoop, Hortonworks Technical Report, Oct [4] K. Shvachko, H. Kuang, S. Radia and R. Chansler, The Hadoop Distributed File System, Proc. of IEEE/NASA Goddard Conference on Mass Storage Systems and Technologies, 0:1 10, [5] Apache HBase, [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processingon large clusters, Proc. of the 6th Symposium on Operating Systems Design & Implementation (OSDI), pages , [7] Apache ZooKeeper, [8] A. Narayanan and V. Shmatikov, Robust De-anonymization of Large Sparse Datasets, Proc. of 2008 IEEE Symposium on Security and Privacy, May [9] L. Sweeney, Uniqueness of Simple Demographics in the U. S. Population, Carnegie Mellon University Technical Report LIDAP- WP4, [10] L. Sweeney, k-anonymity: A Model for Protecting Privacy, International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems, 10(5): , [11] A. Machanavajjhala, J. Gehrke, D. Kifer and M. Venkitasubramaniam, l-diversity: Privacy Beyond k-anonymity, Proc. of the 22nd International Conference on Data Engineering, [12] N. Li, T. Li and S. Venkatasubramanian, t-closeness: Privacy Beyond k-anonymity and l-diversity, Proc. of IEEE 23rd International Conference on Data Engineering, ASE 2014 ISBN:

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Towards Privacy aware Big Data analytics

Towards Privacy aware Big Data analytics Towards Privacy aware Big Data analytics Pietro Colombo, Barbara Carminati, and Elena Ferrari Department of Theoretical and Applied Sciences, University of Insubria, Via Mazzini 5, 21100 - Varese, Italy

More information

Big Data Management and Security

Big Data Management and Security Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America

Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America 1 Top Ten Security and Privacy Challenges for Big Data and Smartgrids Arnab Roy Fujitsu Laboratories of America 2 User Roles and Security Concerns [SKCP11] Users and Security Concerns [SKCP10] Utilities:

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Like what you hear? Tweet it using: #Sec360

Like what you hear? Tweet it using: #Sec360 Like what you hear? Tweet it using: #Sec360 HADOOP SECURITY Like what you hear? Tweet it using: #Sec360 HADOOP SECURITY About Robert: School: UW Madison, U St. Thomas Programming: 15 years, C, C++, Java

More information

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration Hoi-Wan Chan 1, Min Xu 2, Chung-Pan Tang 1, Patrick P. C. Lee 1 & Tsz-Yeung Wong 1, 1 Department of Computer Science

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Fast, Low-Overhead Encryption for Apache Hadoop*

Fast, Low-Overhead Encryption for Apache Hadoop* Fast, Low-Overhead Encryption for Apache Hadoop* Solution Brief Intel Xeon Processors Intel Advanced Encryption Standard New Instructions (Intel AES-NI) The Intel Distribution for Apache Hadoop* software

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK BIG DATA HOLDS BIG PROMISE FOR SECURITY NEHA S. PAWAR, PROF. S. P. AKARTE Computer

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Fast Data in the Era of Big Data: Twitter s Real-

Fast Data in the Era of Big Data: Twitter s Real- Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation

More information

WHAT S NEW IN SAS 9.4

WHAT S NEW IN SAS 9.4 WHAT S NEW IN SAS 9.4 PLATFORM, HPA & SAS GRID COMPUTING MICHAEL GODDARD CHIEF ARCHITECT SAS INSTITUTE, NEW ZEALAND SAS 9.4 WHAT S NEW IN THE PLATFORM Platform update SAS Grid Computing update Hadoop support

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems Proactively address regulatory compliance requirements and protect sensitive data in real time Highlights Monitor and audit data activity

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Privacy Policy. Introduction. Scope of Privacy Policy. 1. Definitions

Privacy Policy. Introduction. Scope of Privacy Policy. 1. Definitions Privacy Policy Introduction This Privacy Policy explains what information TORO Limited and its related entities ("TORO") collect about you and why, what we do with that information, how we share it, and

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

Cloud Storage Solution for WSN Based on Internet Innovation Union

Cloud Storage Solution for WSN Based on Internet Innovation Union Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,

More information

Unless otherwise stated, our SaaS Products and our Downloadable Products are treated the same for the purposes of this document.

Unless otherwise stated, our SaaS Products and our Downloadable Products are treated the same for the purposes of this document. Privacy Policy This Privacy Policy explains what information Fundwave Pte Ltd and its related entities ("Fundwave") collect about you and why, what we do with that information, how we share it, and how

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Performance Analysis of Book Recommendation System on Hadoop Platform

Performance Analysis of Book Recommendation System on Hadoop Platform Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG

Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG 1 The Big Data Working Group (BDWG) will be identifying scalable techniques for data-centric security and privacy problems. BDWG s investigation

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: [email protected]

More information

Data Security in Hadoop

Data Security in Hadoop Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize

More information

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM 2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set

Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set Database security André Zúquete Security 1 Advantages of using databases Shared access Many users use one common, centralized data set Minimal redundancy Individual users do not have to collect and maintain

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

APACHE HADOOP JERRIN JOSEPH CSU ID#2578741

APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]

More information

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

More information

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics

More information

HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1

HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1 HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Project Proposal. Data Storage / Retrieval with Access Control, Security and Pre-Fetching

Project Proposal. Data Storage / Retrieval with Access Control, Security and Pre-Fetching 1 Project Proposal Data Storage / Retrieval with Access Control, Security and Pre- Presented By: Shashank Newadkar Aditya Dev Sarvesh Sharma Advisor: Prof. Ming-Hwa Wang COEN 241 - Cloud Computing Page

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Hadoop Distributed FileSystem on Cloud

Hadoop Distributed FileSystem on Cloud Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, tsikudis}@csd.uoc.gr Abstract. The growing

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Introduction to Apache YARN Schedulers & Queues

Introduction to Apache YARN Schedulers & Queues Introduction to Apache YARN Schedulers & Queues In a nutshell, YARN was designed to address the many limitations (performance/scalability) embedded into Hadoop version 1 (MapReduce & HDFS). Some of the

More information

Big Data Security. Kevvie Fowler. kpmg.ca

Big Data Security. Kevvie Fowler. kpmg.ca Big Data Security Kevvie Fowler kpmg.ca About myself Kevvie Fowler, CISSP, GCFA Partner, Advisory Services KPMG Canada Industry contributions Big data security definitions Definitions Big data Datasets

More information

Privacy Techniques for Big Data

Privacy Techniques for Big Data Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Fair Scheduler. Table of contents

Fair Scheduler. Table of contents Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information