A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis



Similar documents
2. BACKGROUND In this section, we provide a brief background on utilized mechanisms.

Qsoft Inc

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

How To Scale Out Of A Nosql Database

Chase Wu New Jersey Ins0tute of Technology

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Complete Java Classes Hadoop Syllabus Contact No:

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Big Data Course Highlights

Hadoop and Map-Reduce. Swati Gore

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data on Microsoft Platform

Fast Data in the Era of Big Data: Twitter s Real-

Hadoop. Sunday, November 25, 12

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Hadoop Ecosystem B Y R A H I M A.

Hadoop Job Oriented Training Agenda

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Peers Techno log ies Pv t. L td. HADOOP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

So What s the Big Deal?

Big Data Management and Security

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA What it is and how to use?

Cloud Scale Distributed Data Storage. Jürmo Mehine

Large scale processing using Hadoop. Ján Vaňo

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Hadoop Usage At Yahoo! Milind Bhandarkar

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Deploying Hadoop with Manager

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Cassandra in Action ApacheCon NA 2013

Big Data Too Big To Ignore

Big Data and Data Science: Behind the Buzz Words

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Transforming the Telecoms Business using Big Data and Analytics

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Big Data Training

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Unified Big Data Processing with Apache Spark. Matei

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Introduction to NOSQL

Enabling SOX Compliance on DataStax Enterprise

BIG DATA TRENDS AND TECHNOLOGIES

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Practical Cassandra. Vitalii

Certified Big Data and Apache Hadoop Developer VS-1221

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

NoSQL Roadshow Berlin Kai Spichale

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Big Data. Lyle Ungar, University of Pennsylvania

The? Data: Introduction and Future

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

The Hadoop Eco System Shanghai Data Science Meetup

Data Services Advisory

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Apache Cassandra for Big Data Applications

Distributed Systems. Tutorial 12 Cassandra

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

HDFS. Hadoop Distributed File System

The Cloud Trade Off IBM Haifa Research Storage Systems

Ankush Cluster Manager - Hadoop2 Technology User Guide

Big Data and Scripting Systems build on top of Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

NoSQL and Hadoop Technologies On Oracle Cloud

Data Management in SAP Environments

COURSE CONTENT Big Data and Hadoop Training

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing mcombs@veristorm.com

MapReduce with Apache Hadoop Analysing Big Data

Massive Cloud Auditing using Data Mining on Hadoop

BIG DATA HADOOP TRAINING

An Approach to Implement Map Reduce with NoSQL Databases

Workshop on Hadoop with Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Moving From Hadoop to Spark

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Open Source Technologies on Microsoft Azure

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Dominik Wagenknecht Accenture

Big Data Training - Hackveda

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Talend Big Data. Delivering instant value from all your data. Talend

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)

Implement Hadoop jobs to extract business value from large and varied data sets

Transcription:

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies Bingdong Li, Jeff Springer, Mehmet Gunes, George Bebis University of Nevada Reno FloCon 2013 January 7-10, Albuquerque, New Mexico

Agenda Review Challenges Apache Hadoop Related Technologies System Design Demonstration Thoughts and Pitfalls Summary

Publications By Years Bingdong Li, Jeff Spinger, George Bebis, Mehmet Hadi Gunes, A Survey of Network Flow Applications, Journal of Networks and Computer Applications (accepted).

Research Perspectives By Years Bingdong Li, Jeff Spinger, George Bebis, Mehmet Hadi Gunes, A Survey of Network Flow Applications, Journal of Networks and Computer Applications (accepted).

Methods By Years Bingdong Li, Jeff Spinger, George Bebis, Mehmet Hadi Gunes, A Survey of Network Flow Applications, Journal of Networks and Computer Applications (accepted).

Challenges Too much data (volume) Real Time and On Demand (velocity) Various types/sources of data (variety) Changing requirements(variability) Big Data Volume, Velocity, Variety (Gartner s Doug Laney), Variability (Forrester s James Kobielus G. etc.) http://blogs.sas.com/content/datamanagement/2011/11/05/big-data-defined-its-more-than-hadoop/.

Apache Hadoop Related Technologies What is Apache Hadoop? Open source, storing and processing Big Data Main Systems: Hadoop Distributed File System (HDFS) MapReduce

Apache Hadoop Related Technologies Data collection: Flume, Chukwa, Storage: HDFS, Cassandra, CouchDB, Processing: MapReduce, Pig, Hive, Mahout

Design Goals Philosophy Components Data Collecting Data Storage Data Schema Data Process User Interfaces

Design Goals Real time network query, near real time measurement and analysis Distributed system for data collecting, storing, accessing, measuring and analyzing NetFlow and other log data Models of detection and classification based on profiling and behavior

Design Philosophy Leverage existing technologies Modeling known objects rather than unknown objects or use white list rather than black list

Design: Components

Design: Components Flume: open source collecting, aggregating, and moving data from many different sources to data store Masters: keep track all the nodes and inform them Agents: Sources accept data, Sinks aggregate and send data, Decorator filter, sample and modify data flow.

Design: Components C A P Conjecture A web service can only satisfy any two of Consistency Availability Partition Tolerance Cassandra is AP, arguably CAP with specifying consistency level Any, one, quorum, local_quorum, each_quorum, ALL Gilbert, Seth and Lynch, Nancy, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, SIGAACT News, 2002

Design: Components Cassandra Data Scehma Keyspace Column family Rows and Columns

Design: Components Cassandra Index Primary Index (row key) Secondary Index (column values) DIY with wide row or inverted index Composite Column Third party indexing such as ElasticSearch, Solandra, DataStax Enterprise Counter

Design: Components Data Processing Query network by CQL, or Web UI (Nodejs) Network measurement by Pig scripting, R Advanced data mining and network modeling by programming written by C++ and Java Scheduling tasks

Design: Components User Interface Web User: through a secure internal web page to see reports, schedule advanced analysis tasks Advanced System User: use cassandra-cli, CQL, Pig, and R to do advanced measurement and analysis 18

Design: Features Query Network Status Network Measurement Advanced Network Modeling Host Role s Behavior Roles of Subnet Behavior User Behaviors of Hosts

Demonstration Flume

Demonstration Cassandra Cluster

Demonstration Query by Key

Demonstration Measuring anonymity network usage on campus by using Pig scripting It takes less than 10 minutes to process 205 million packets, about 1.44TB data, writing less than 200 lines of Pig scripting code. Bingdong Li, Esra Edrin, Mehmet Hadi Gunes, George Bebis, Todd Shipley, A Study of Anonymity Technology Usage on the Internet, submitted to Computer Communication

Demonstration Analyzed Anonymity Networks Network Servers Service Tor 61,798 General I2P 2,267 P2P JAP 11 General Remailers 15 Email Proxies 7,246 General Commercial Anomymizer,Gotrusted General Bingdong Li, Esra Edrin, Mehmet Hadi Gunes, George Bebis, Todd Shipley, A Study of Anonymity Technology Usage on the Internet, submitted to Computer Communicatio

Anonymity Network Usage Geolocation

Anonymity Network Usage Distribution

Demonstration Example of Advanced Network Modeling Model Host Role s Behaviors Algorithms: On-line SVM based on Bordes Methods Ground Truth: Host Information in Active Directory and vulnerability scanner Nessus database. Antoine Bordes, etc. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6:1579 1619, September 2005.

Demonstration Client vs Server Classification Accuracy

Thoughts and Pitfalls Low Cost Open Source, Distributed Be patient and careful for Incompatibility between different versions of components Be willing to learn, it is a new era of big data Cassandra Replica Factor = 1? Do not even try What do you do for Exception error? Handle, Ignore or throw it

Summary A design of distrusted real time network security system based on Apache Hadoop related technologies Demonstration Thoughts and pitfalls

Questions and Discussions Contact: Bingdong Li bingdongli@unr.edu