Massive Cloud Auditing using Data Mining on Hadoop



Similar documents
How To Handle Big Data With A Data Scientist

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop. Sunday, November 25, 12

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Data Refinery with Big Data Aspects

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Analytics - Accelerated. stream-horizon.com

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

CDH AND BUSINESS CONTINUITY:

How To Scale Out Of A Nosql Database

Packet Flow Analysis and Congestion Control of Big Data by Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Hadoop implementation of MapReduce computational model. Ján Vaňo

The big data revolution

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

UPS battery remote monitoring system in cloud computing

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Large scale processing using Hadoop. Ján Vaňo

Luncheon Webinar Series May 13, 2013

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Professional Big Data Master s Program to train Computational Specialists

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Open source Google-style large scale data analysis with Hadoop

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

BIG DATA TRENDS AND TECHNOLOGIES

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop IST 734 SS CHUNG

Manifest for Big Data Pig, Hive & Jaql

2015 The MathWorks, Inc. 1

Transforming the Telecoms Business using Big Data and Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

The Future of Data Management

BIG DATA What it is and how to use?

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Assignment # 1 (Cloud Computing Security)

BiDAl: Big Data Analyzer for Cluster Traces

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Big Data? Definition # 1: Big Data Definition Forrester Research

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data and Industrial Internet

Big Data and Apache Hadoop s MapReduce

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chase Wu New Jersey Ins0tute of Technology

Oracle Big Data SQL Technical Update

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

HDP Hadoop From concept to deployment.

How Companies are! Using Spark

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

So What s the Big Deal?

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Advanced Big Data Analytics with R and Hadoop

How To Make Data Streaming A Real Time Intelligence

Building your Big Data Architecture on Amazon Web Services

Enabling High performance Big Data platform with RDMA

Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop

From GWS to MapReduce: Google s Cloud Technology in the Early Days

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

BIG DATA TOOLS. Top 10 open source technologies for Big Data

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and Map-Reduce. Swati Gore

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Play with Big Data on the Shoulders of Open Source

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

I/O Considerations in Big Data Analytics

The Potential of Big Data in the Cloud. Juan Madera Technology Consultant

A Brief Introduction to Apache Tez

Hadoop Big Data for Processing Data and Performing Workload

White Paper. Version 1.2 May 2015 RAID Incorporated

Reference Architecture, Requirements, Gaps, Roles

Big Data Use Case: Business Analytics

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

TRAINING PROGRAM ON BIGDATA/HADOOP

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

The 3 questions to ask yourself about BIG DATA

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Modern Data Architecture for Predictive Analytics

Architectures for Big Data Analytics A database perspective

Transcription:

Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University

Outline Massive Cloud Auditing Traffic Characterization Distributed Data Storage and Online Querying Online Data Mining System Architecture Prototype Work Accomplished Future Work

Massive Cloud Auditing Cloud Auditing of massive logs requires analyzing data volumes which routinely cross the peta-scale threshold. Computational and storage requirements of any data analysis methodology will be significantly increased. Distributed data mining algorithms and implementation techniques needed to meet scalability and performance requirements entailed in such massive data analyses. Current distributed data mining approaches pose serious issues in performance and effectiveness in information extraction of cloud auditing logs. Reasons include scalability, dynamic and hybrid workload, high sensitivity, and stringent time constraints..

Cloud Auditing Challenges Data Mining Approaches

Traffic Characterization Cloud Traffic logs accumulated from diverse and geographically disparate sources. Sources include stored and live traffic from popular web applications: Web and Email Live Packet Capture from packet sniffing tools (Wireshark) Honeypot traffic from UTSA comprising of malicious traffic. Augment traffic information with IP,DNS, geolocation analysis procured from publicly available datasets and high level network and flow statistics.

Traffic Characterization IP geolocation from public databases retrieve the name and street address of the organization which registered the address block. For large ISPs the registered street address usually differs from the real location of its hosts. Measurement based IP geolocation utilize active packet delay measurements to approximate the geographical location of network hosts. Secure IP geolocation to defend against adversaries manipulating packet delay measurements to forge locations Traffic characterization will generate massive amount of data. Need for distributed data storage.

Distributed Data Storage and Online Querying Hadoop based distributed data storage and online querying solution. Hadoop: Software library which supports distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage Hbase and Hive Distributed Data Storage Chukwa: Distriubted data collection system based on Hadoop system.

Distributed Data Storage and Online Querying Relatively high latency in Hadoop system which is not desirable for online query processing. Design middleware to support structured and unstructured data and multiple data storage(hive and HBASE) Design appropriate data structure for the auditing data to achieve quick retrieve and access of large data. Develop efficient data collection system to reduce the amount of network disk access. Develop algorithms and implementation techniques for both batch processing and online query processing of cloud audit data based on Hadoop distributed system.

Online Data Mining Develop data mining algorithms that work in a massively parallel and yet online fashion for mining of large data streams Reducing time between query submission and obtaining results. Overall speed of query processing depends critically on the query response time Map-Reduce programming model used for fault-tolerant and massively parallel data crunching. But Map-Reduce implementations work only in batch mode and do not allow stream processing or exploiting of preliminary results.

Online Data Mining Online Aggregation: Reduce query processing time by monitoring of preliminary results of an aggregation query along with estimates on the result's error intervals. Proposed Solution: Develop game theoretic model to combine Map-Reduce parallelization and online aggregation to reduce query processing time. Lack of memory to store preliminary data mining results when processing streams, and expensive computation for growing size of preliminary result grows and frequent input updates Both problems can be solved by limiting the number of historical" results to be stored.

Online Data Mining Balance the risks of using a preliminary result against the benefits of an accelerated computation, two questions need to be answered: What is the likelihood that the query result is likely to change if all data is processed?" and How much time can be saved if the current result is used?". Game theoretic approach to answer these questions by providing mechanisms for monitoring the convergence of the results and the processing progress. Benefits of Accelerated Query Processing Query Processing Time % of preliminary results Convergence point? Risk of Using Preliminary Results

System Architecture App App App App Data Capture Agent Data Capture Agent Data Capture Agent Data Capture Agent Collector Collector User Hbase,Hive Database Hadoop MapReduce Query Interface Hadoop Distributed File System (HDFS) Data Mining Hadoop Cluster

Prototype

Work Accomplished Malicious traffic available from UTSA s cyber security monitoring system. Developed prototype of IP and DNS analysis of malicious and normal traffic Design of secure IP geolocation algorithm to defend against attacks on delay measurements. Implemented Hadoop based prototype for collecting and storage of network flow trace. Developed prototype of web user interface to retrieve data stored in Hadoop based system. Enhanced the portability of the Chukwa s data collection process by supporting Windows based cloud users.

Future Works Develop secure IP geolocation algorithm to defend against attacks on measurement based IP geolocation Develop Hadoop based algorithms and implementation techniques for distributed data storage and online query processing Design and evaluate middleware to support structured and unstructured data and multiple databases Design efficient data collection system to reduce the amount of network disk access. Develop game theoretic model for refining convergence monitoring, and develop tools to handle concept drift in data.