Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black



Similar documents
Scalable Forensics with TSK and Hadoop. Jon Stewart

Hadoop Architecture. Part 1

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop and Map-Reduce. Swati Gore

CSE-E5430 Scalable Cloud Computing Lecture 2

Chapter 7. Using Hadoop Cluster and MapReduce

Integrating Big Data into the Computing Curricula

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop IST 734 SS CHUNG

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data With Hadoop

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Open source Google-style large scale data analysis with Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache HBase. Crazy dances on the elephant back

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

L1: Introduction to Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Large scale processing using Hadoop. Ján Vaňo

BIG DATA What it is and how to use?

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Open source large scale distributed data management with Google s MapReduce and Bigtable

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Internals of Hadoop Application Framework and Distributed File System

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

How To Scale Out Of A Nosql Database

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Apache Hadoop s MapReduce

<Insert Picture Here> Big Data

Case Study : 3 different hadoop cluster deployments

Implement Hadoop jobs to extract business value from large and varied data sets

Comparing SQL and NOSQL databases

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Apache Hadoop FileSystem and its Usage in Facebook

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop & its Usage at Facebook

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop implementation of MapReduce computational model. Ján Vaňo

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Big Data Analytics - Accelerated. stream-horizon.com

Benchmarking Hadoop & HBase on Violin

Business Intelligence for Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Application Development. A Paradigm Shift

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Introduction to Hadoop

Suresh Lakavath csir urdip Pune, India

The Hadoop Framework

A very short Intro to Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Intro to Map/Reduce a.k.a. Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

HADOOP: Scalable, Flexible Data Storage and Analysis

Introduction to Cloud Computing

Hadoop & its Usage at Facebook

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

International Journal of Advance Research in Computer Science and Management Studies

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

GraySort and MinuteSort at Yahoo on Hadoop 0.23

HADOOP MOCK TEST HADOOP MOCK TEST II

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Generic Log Analyzer Using Hadoop Mapreduce Framework

MapReduce and Hadoop Distributed File System

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Click Stream Data Analysis Using Hadoop

Enabling High performance Big Data platform with RDMA

Hadoop Distributed File System (HDFS) Overview

Certified Big Data and Apache Hadoop Developer VS-1221

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Fault Tolerance in Hadoop for Work Migration

Apache Hadoop: The Big Data Refinery

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Oracle Big Data SQL Technical Update

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Hadoop. Sunday, November 25, 12

Peers Techno log ies Pv t. L td. HADOOP

Distributed File Systems

Transcription:

Forensic Clusters: Advanced Processing with Open Source Software Jon Stewart Geoff Black

Who We Are Mac Lightbox Guidance alum Mr. EnScript C++ & Java Developer Fortune 100 Financial NCIS (DDK/ManTech) Lightbox Guidance alum Examiner Pessimist Optimist Python Search PSTs Louisiana DOJ ACEDS EnScript JavaScript NSFs Linux C# Jon Stewart Like many of you, we've dealt with a lot of data. Geoff Black

CPU Clock Speed

Hard Drive Capacity

The Problem CPU clock speed stopped doubling Hard drive capacity kept doubling Multicore CPUs to the rescue!...but they're wasted on single-threaded apps...and hard drive transfer speeds might be too slow for 24 core machines (depending)

Split the data up Solution: Distributed Processing Process it in parallel Scale out as needed Sounds great!!

I know, let's use Oracle!...

Storage quickly becomes the bottleneck $$$$$$$

These guys have a lot of data......how do they do their processing?

What about these guys? http://wiki.apache.org/incubator/accumuloproposal

Apache Hadoop Apache Hadoop is an open source software suite for distributed processing and storage, primarily inspired by Google's in-house systems. http://hadoop.apache.org

The Foundation Apache Hadoop created at Yahoo! + Based on Google MapReduce and File System Apache HBase based on Google BigTable

Apache Hadoop: Scalable, Fault-tolerant, Open Source Hadoop Distributed File System (HDFS) MapReduce Framework Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/

Apache Hadoop: Scalable, Fault-tolerant, Open Source Hadoop Distributed File System (HDFS) MapReduce Framework Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/

HDFS Just like any other file system, but better Metadata master server ( NameNode ) One server acts as FAT/MFT Knows where files & blocks are on cluster Blocks on separate machines ( DataNodes ) Automatic replication of blocks(default 3x) All blocks have checksums Rack-aware DataNodes form p2p network Clients contact directly for reading/writing file data Not general purpose Files are read-only once written Optimized for streaming reads/writes of large files

Apache Hadoop: Scalable, Fault-tolerant, Open Source Hadoop Distributed File System (HDFS) MapReduce Framework Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/

MapReduce Bring the application to the data Code is sent to every node in the cluster Local data is then processed as "tasks" Support for custom file formats Failures are restarted elsewhere, or skipped One nodes can process several tasks at once Idle CPUs are bad Output is automatically collated Parallelism is transparent to programmer

MapReduce Example Yahoo! has over 500M unique users / month Billions of transactions per day Search Assist (type ahead) Based on years of log data Before Hadoop: 26 days processing time After Hadoop: 20 minutes

HBase Random access, updates, versioning Conventional RDBMS (MySQL) Limited throughput Not distributed Fields always allocated varchar[255] Strict schemas ACID integrity/transactions Support for joins Indices speed things up HBase High write throughput Distributed automatically Null values not stored A row is Name:Value list Schemas not imposed Rows sorted by key No joins, no transactions Good at table scans Fields are versioned

HBase Auto-Sharding Logical Table Regions Region Servers A A B B C C D D E E F F G G H H I I J J K K L L M M N N O O Client Source: http://www.slideshare.net/cloudera/hadoop-world-2011-advanced-hbase-schema-design-lars-george-cloudera

So, let's do forensics on Hadoop! We'd done some work on proof-of-concept and design Army Intelligence Center of Excellence funded an open source prototype effort In collaboration with 42six Solutions, Basis Technologies, and Dapper Vision

Open Source Stack for Distributed Forensic Processing Processing fsrip plugins MRCoffee Document Analysis Dapper Vision Picarus Storage & Job Management Libraries The Sleuth Kit +

Pipeline Ingest Processing Review

Ingest: File Extraction / Hash Calculation / dd,e01 fsrip The Sleuth Kit md5sum sha1sum +

Pipeline Ingest Processing Review

Processing Tasks Hash set analysis Keyword searching Text extraction Document clustering Face detection Graphics clustering Video analysis Not limited parse *anything* with plugin interface

Processing plugins MRCoffee Document Analysis Dapper Vision Picarus +

Processing Plugins Plugin interface for community members to extend functionality Python + <other languages> Plugins can return data to the system: PST file -> Emails ->HBase rows Internet History records ->HBase rows Extracting zip files ->HBase rows

Processing Flow plugin Parsed Records

Processing with MRCoffee Sandboxed virtual processing Creates supermin VM on the fly with kvm 100s of KB. No network stack, no gcc, no nothin'. VM uses host kernel, glibc, etc. Includes whatever support files you give it Uses character device for data transfer Inside VM, MRCoffee daemon pumps data to your own utility on stdin, pipes data from stdout back to host HBase

Pipeline Ingest Processing Review

Review: Reporting

Review: Image Clustering

Review: Video Keyframe Analysis

Wrap Up Yes, we can scale forensics We can do it without dongles We can do it without fancy hardware We're CPU rich, finally We can add machines and do more/better analysis Not quite ready for primetime......some lab will need to be first

https://github.com/jonstewart/tp Jon Stewart Geoff Black jon@lightboxtechnologies.com geoff@lightboxtechnologies.com