How To Create A Large Data Storage System



Similar documents
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

On a Hadoop-based Analytics Service System

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

The Inside Scoop on Hadoop

Big Data Analytics Nokia

So What s the Big Deal?

Introduction To Hive

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

ITG Software Engineering

Data processing goes big

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Implement Hadoop jobs to extract business value from large and varied data sets

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

HadoopRDF : A Scalable RDF Data Analysis System

Hadoop & Spark Using Amazon EMR

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

CASE STUDY OF HIVE USING HADOOP 1

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Fast, Low-Overhead Encryption for Apache Hadoop*

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

American International Journal of Research in Science, Technology, Engineering & Mathematics

Can the Elephants Handle the NoSQL Onslaught?

AWS Schema Conversion Tool. User Guide Version 1.0

Replicating to everything

2009 ikeep Ltd, Morgenstrasse 129, CH-3018 Bern, Switzerland (

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Big Data on Microsoft Platform

Analyzing Big Data with AWS

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Oracle Database 12c: Introduction to SQL Ed 1.1

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Advanced SQL Query To Flink Translator

Native Connectivity to Big Data Sources in MSTR 10

Creating Connection with Hive

Trafodion Operational SQL-on-Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Storing and Processing Sensor Networks Data in Public Clouds

Alternatives to HIVE SQL in Hadoop File Structure

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data Challenges. Alexandru Adrian TOLE Romanian American University, Bucharest, Romania

Oracle Big Data SQL Technical Update

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Cloud Computing Trends

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Big Data Security. Kevvie Fowler. kpmg.ca

Hadoop. Sunday, November 25, 12

How To Handle Big Data With A Data Scientist

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Lofan Abrams Data Services for Big Data Session # 2987

NoSQL and Hadoop Technologies On Oracle Cloud

Applied research on data mining platform for weather forecast based on cloud storage

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Moving From Hadoop to Spark

Using distributed technologies to analyze Big Data

Map Reduce & Hadoop Recommended Text:

A Comparison of Approaches to Large-Scale Data Analysis

A Scalable Data Transformation Framework using the Hadoop Ecosystem

CitusDB Architecture for Real-Time Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Discovering Business Insights in Big Data Using SQL-MapReduce

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Data storing and data access

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Job Oriented Training Agenda

Log Mining Based on Hadoop s Map and Reduce Technique

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Oracle Database Cloud Service Rick Greenwald, Director, Product Management, Database Cloud

Search and Real-Time Analytics on Big Data

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Transcription:

UT DALLAS Erik Jonsson School of Engineering & Computer Science Secure Data Storage and Retrieval in the Cloud

Agenda Motivating Example Current work in related areas Our approach Contributions of this paper System architecture Experimental Results Conclusions and Future Work

Motivating Example Current Trend: Large volume of data generated by Twitter, Amazon.com and Facebook Current Trend: This data would be useful if it can be correlated to form business partnerships and research collaborations Challenges due to Current Trend: Two obstacles to this process of data sharing Arranging a large common storage area Providing secure access to the shared data

Motivating Example Addressing these challenges: Cloud computing technologies such as Hadoop HDFS provide a good platform for creating a large, common storage area A data warehouse infrastructure such as Hive provides a mechanism to structure the data in HDFS files. It also allows adhoc querying and analysis of this data Policy languages such as XACML allow us to specify access controls over data This paper proposes an architecture that combines Hadoop HDFS, Hive and XACML to provide fine-grained access controls over shared data

Current Work Work has been done on security issues with cloud computing technologies Hadoop v0.20 proposes solutions to current security problems with Hadoop This work is in its inception stage and proposes simple access control list (ACL) based security mechanism Our system adds another layer of security above this security As the proposed Hadoop security becomes robust it will only strengthen our system

Current Work Amazon Web Services (AWS) provide a web services infrastructure platform in the cloud To use AWS we would need to store data in an encrypted format since the AWS infrastructure is in the public domain Our system is trusted since the entire infrastructure is in the private domain

Current Work The Windows Azure platform is an Internet-scale cloud computing services platform This platform is suitable for building new applications but not to migrate existing applications We did not use this platform since we wanted to port our existing application to an open source environment We also did not want to be tied to the Windows framework but allow this system to be used on any platform

Contributions of this paper Create an open source application that combines existing open source technologies such as Hadoop and Hive with a policy language such as XACML to provide fine-grained access control over data Ensure that the new system does not create a performance hit when compared to using Hadoop and Hive directly

System Architecture

System Architecture - Web Application Layer This layer is the only interface provided by our system to the user Provides different functions based on a user s permissions users who can query the existing tables/views users who can create tables/views and define policies on them in addition to being able to query an admin user who in addition to the above can also assign new users to either of the above categories We use the salted hash technique to store usernames/passwords in a secure location

System Architecture - ZQL Parser Layer ZQL is a Java based SQL parser The Parser layer takes as input a user query and continues to the Policy layer if the query is successfully parsed or returns an error message The variables in the SELECT clause are returned to the Web application layer to be used in the results The tables/views in the FROM clause are passed to the Policy evaluator The parser currently supports SQL DELETE, INSERT, SELECT and UPDATE statements

System Architecture - XACML Policy Layer XACML Policy Builder Tables/Views are treated as resources for building policies We use a table/view to query-type mapping table1 SELECT INSERT view1 SELECT to create policies using Sun s XACML implementation Since a view is constructed from one or more tables, this allows us to define fine-grained access controls over the data A user can upload their own pre-defined policies or have the system build the policy for them at the time of table/view creation

System Architecture - XACML Policy Layer XACML Policy Evaluator Use the query-type to user mapping SELECT user1 user2 INSERT user1 user3 to extract the kinds of queries that a user can execute Use Sun s implementation to verify if a given query-type can be executed on all tables/views that are defined in any user query If permission is granted for all tables/views, the query is processed further, else an error is returned The policy evaluator is used during query execution as well as during table/view creation

System Architecture - Basic Query Rewriting Layer Adds another layer of abstraction between a user and HiveQL Allows a user to enter SQL queries that are rewritten according to HiveQL s syntax Two simple rewriting rules in our system: SELECT a.id, b.age FROM a, b; SELECT a.id, b.age FROM a JOIN b; INSERT INTO a SELECT * FROM b; INSERT OVERWRITE TABLE a SELECT * FROM b;

System Architecture - Hive Layer Hive is a data warehouse infrastructure built on top of Hadoop Hive allows us to put structure on files stored in the underlying HDFS as tables/views Tables in Hive are defined using data in HDFS files while a view is only a logical concept in Hive HiveQL is used to query the data in these tables/views

System Architecture - HDFS Layer The HDFS is a distributed file system designed to run on basic hardware In our framework, the HDFS layer stores the data files corresponding to tables created in Hive Security Assumption Files in HDFS can neither be accessed using Hadoop s web interface nor Hadoop s command line interface but only using our system

Experiments and Results Two datasets Freebase system - an open repository of structured data that has approximately 12 million topics TPC-H benchmark - a decision support benchmark that consists of a typical business organization schema For Freebase we constructed our own queries while for TPC-H we used Q1, Q3, Q6 and Q13 from the 22 benchmark queries Tested table loading times and querying times for both datasets

Experiments and Results Our system currently allows a user to upload files that are at most 1GB in size All loading times are therefore restricted by the above condition For querying times with larger datasets we manually added the data in the HDFS For all experiments XACML policies were created in such a way that the querying user was able to access all the necessary tables and views

Experiments and Results - Freebase Loading time of our system versus Hive is similar for small sized tables As the number of tuples increases our system gets slower This time difference is attributed to data transfer through a Hive JDBC connection to Hadoop

Experiments and Results - Freebase Our running times are slightly faster than Hive This is because of the time taken by Hive to display results on the screen Both running times are fast because Hive does not need a Map-Reduce job for this query, but simply returns the entire table

Experiments and Results - Freebase Query SELECT name, id FROM Person LIMIT 100; SELECT id FROM Person WHERE name= Frank Mann LIMIT 100; CREATE VIEW Person_View AS SELECT name, id FROM Person; System Time (sec) 27.1 28.4 30.2 30.5 0.19 0.11 Hive Time (sec)

Experiments and Results - TPC-H Similar to the Freebase results, our system gets slower as the number of tuples increases The trend is linear since the tables sizes increase linearly with the Scale Factor

Experiments and Results - TPC-H Query Q6 Scale Factor (SF) System Time (sec) Hive Time (sec) 100 605.24 590.66 300 1815.45 1806.4 1000 6240.33 6249.68 100 1675.19 1670.77 Q3 300 7532.23 7511.52 1000 61411.21 61390.71

Experiments and Results - TPC-H Query Q13 Scale Factor (SF) System Time (sec) Hive Time (sec) 100 870.70 847.52 300 1936.35 1910.19 1000 7322.54 7304.39 100 1210.04 1209.79 Q1 300 5407.14 5411.62 1000 42780.67 42768.83

Conclusions A system was presented that allows secure sharing of large amounts of information The system was designed using Hadoop and Hive to allow scalability XACML was used to provide fine-grained access control to the underlying tables/views We have combined existing open source technologies in a unique way to provide fine-grained access control over data We have ensured that our system does not create a performance hit

Future Work Extend the ZQL parser with support for more SQL keywords Extend the basic query rewriting engine into a more sophisticated engine Implement materialized views in Hive and extend HiveQL with support for these views Extend the simple security mechanism with more query types such as CREATE and DELETE Extend this work to include public clouds such as Amazon Simple Storage Services