Big Data Analytics for Cyber



Similar documents
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Explained. An introduction to Big Data Science.

Big Data and Industrial Internet

HDP Hadoop From concept to deployment.

Hadoop and Map-Reduce. Swati Gore

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop implementation of MapReduce computational model. Ján Vaňo

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Map Reduce & Hadoop Recommended Text:

The Inside Scoop on Hadoop

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How To Scale Out Of A Nosql Database

Large scale processing using Hadoop. Ján Vaňo

The Future of Data Management

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

COMP9321 Web Application Engineering

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

HADOOP. Revised 10/19/2015

HDP Enabling the Modern Data Architecture

Architectures for massive data management

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Open source Google-style large scale data analysis with Hadoop

BIG DATA USING HADOOP

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

Oracle Big Data SQL Technical Update

Real-time Big Data Analytics with Storm

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Upcoming Announcements

YARN Apache Hadoop Next Generation Compute Platform

Dominik Wagenknecht Accenture

Keyword: YARN, HDFS, RAM

Real Time Big Data Processing

Big Data and Analytics: Challenges and Opportunities

Dell In-Memory Appliance for Cloudera Enterprise

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

The Internet of Things and Big Data: Intro

Big Data Technologies Compared June 2014

Sunnie Chung. Cleveland State University

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

The Future of Data Management with Hadoop and the Enterprise Data Hub

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Deploying Hadoop with Manager

Big Data and Data Science: Behind the Buzz Words

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

<Insert Picture Here> Big Data

Cost-Effective Business Intelligence with Red Hat and Open Source

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Open source large scale distributed data management with Google s MapReduce and Bigtable

A Survey on Big Data Concepts and Tools

Big Data and Market Surveillance. April 28, 2014

Moving From Hadoop to Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Introduction to Big Data Training

#TalendSandbox for Big Data

Hadoop & Spark Using Amazon EMR

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

So What s the Big Deal?

How Companies are! Using Spark

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data Management and Security

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Course Highlights

Implement Hadoop jobs to extract business value from large and varied data sets

Application Development. A Paradigm Shift

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Are You Ready for Big Data?

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

SAP and Hortonworks Reference Architecture

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Analytics - Accelerated. stream-horizon.com

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Are You Ready for Big Data?

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

White Paper: What You Need To Know About Hadoop

TRAINING PROGRAM ON BIGDATA/HADOOP

Big Data and Hadoop for the Executive A Reference Guide

Modernizing Your Data Warehouse for Hadoop

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Unified Big Data Processing with Apache Spark. Matei

Transcription:

Big Data Analytics for Cyber AFCEA International Cyber Symposium June 24, 2014 Jon Lau, Vice President and CTO UMBC Training Centers 6/26/2014 umbctraining.com 443-692-6600 1

Agenda About UMBC & UMBC Training Centers The Interest in Data Science / Analytics Some History Big Data / Internet Scale Hadoop 1.0 Hadoop 2.0 / YARN Streaming Analytics Cyber Applications Data Analytics Project Considerations

About UMBC Quick Facts UMBC: University of Maryland, Baltimore County Founded in 1966 Member of the University System of Maryland Over 12,000 undergraduate, graduate and Ph.D students Strengths in Science, Technology, Engineering, Math, and Education Research intensive UMBC receives approximately $100M in annual federal funding for research http://www.umbc.edu

About UMBC Training Centers Founded by UMBC in 2000 Applied, non-credit training programs for working professionals and organizations Core programs area include: Information Technology Cyber Security Engineering (Electrical, Mechanical, Environmental) Systems Engineering Management, Leadership and Innovation Inst http://www.umbctraining.com 6/26/2014 2014, UMBC Training Centers, LLC 4

Data Analytics & Big Data The past fifteen years have seen extensive investments in business infrastructure, which have improved the ability to collect data throughout the enterprise. Virtually every aspect of business is now open to data collection and often even instrumented for data collection: operations, manufacturing, supply-chain management, customer behavior, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on external events such as market trends, industry news, and competitors movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data the realm of data science. --from Data Science for Business by Foster Provost and Tom Fawcett; Publisher: O'Reilly Media, Inc., 2013. 6/26/2014 2014, UMBC Training Centers, LLC 5

Real World Examples Fraud Detection (Credit Cards, Insurance Claims) Tax Compliance (IRS) Advertising (Google, Yahoo) Customer Recommendations (Amazon, Netflix) Customer Retention (Verizon, Comcast) Intelligence Analysis / Cybersecurity Law Enforcement: Criminal Activity Tracking & Prediction Manufacturing: Quality and Process Improvement Finance / Investment Management 6/26/2014 2014, UMBC Training Centers, LLC 6

History Is this New? FOCUS on the Mainframe SQL Ad Hoc Query Tools SAS OLAP Business Intelligence Data Warehouse Sybase IQ Object Oriented Databases for Time Series Analysis Unix / Open Systems / Distributed Computing 6/26/2014 2014, UMBC Training Centers, LLC 7

History Is this New? Operations Research Simulation and Modeling Optimization Statistics Machine Learning Expert Systems Data Science Data Mining 6/26/2014 2014, UMBC Training Centers, LLC 8

What is New? Everything is a Computer With Ubiquitious Internet Connectivity That Tells you Everything about Itself, Constantly Including People (Social Media, Geolocation, Privacy becoming a quaint anachronism) Big Data Volume GB, TB, PB, EB, ZB, YB Velocity Mbps, Gbps, Tbps? Variety Unstructured, Multimedia, Time Series 6/26/2014 2014, UMBC Training Centers, LLC 9

Key Terms and Buzzwords 2011, Intel Free Press 6/26/2014 2014, UMBC Training Centers, LLC 10

Big Data The Problem Twitter: 58 million tweets per day Facebook: 1.3 billion users; 300 PB data + 600 TB daily LinkedIn: Over 225 million members YouTube: 100 hours of video uploaded every minute Large Hadron Collider 10s of PB per year Large Synoptic Survey Telescope 30TB per night Boeing jet generates 10 TB of information per engine every 30 minutes of flight Internet of Things sensors everywhere Existing Tech Fails: SQL, CPU, RAM, proprietary software 6/26/2014 2014, UMBC Training Centers, LLC 11

New Tech: Capitalism + Socialism? New Internet Business Models + Billions of $$ in Venture Capital = Massive Incentive to Innovate Free and Open Source Software (FOSS) Model fuels a Philosophy, Culture, and Ecosystem of Innovation Red Hat, Ubuntu: Linux Google, Yahoo: Hadoop Facebook - Cassandra, and many others LinkedIn - Kafka, Voldemort,... Amazon AWS Netflix - many... 6/26/2014 2014, UMBC Training Centers, LLC 12

Big Data 1.0 MapReduce / Hadoop Developed at Google (Google File System, Nutch) Evolved to Hadoop / MapReduce at Yahoo Designed to solve problems at Internet scale (i.e. Search) Very Simple Idea: Map a large data set into discrete chunks or blocks Merge / aggregate ( Reduce ) the data to produce final result Move the computing to the data (reversing traditional models) Spread the computing work across 10s, 100s, 1000s of nodes Build on top of a robust distributed file system (HDFS) 6/26/2014 2014, UMBC Training Centers, LLC 13

MapReduce in Hadoop 6/26/2014 2014, UMBC Training Centers, LLC 14

Scaling Up vs. Scaling Out Scaling up: Increase processing on one machine (highend server processor, RAM, storage) Scaling out: many machines running in parallel on same local network 6/26/2014 2014, UMBC Training Centers, LLC 15

Hadoop 1.x Success Became the open source Apache Hadoop project Other synergistic tools were developed to leverage or extend MapReduce and HDFS (e.g. Pig, Hive) Leverages the inherent openness and scalability of Linux Many real world problems fit the MapReduce model Extremely successful in Internet scale applications: Yahoo 30,000+ node Hadoop cluster Facebook 100 PB cluster Many others (Netflix, New York Times, Government) 6/26/2014 2014, UMBC Training Centers, LLC 16

Hadoop 1.x Limitations MapReduce not ideal for many applications Limited to batch-oriented processing not well suited for interactive or real time / streaming applications Resource management constrained by JobTracker architecture Scalability bottlenecks in Mapper & Reducer utilization 6/26/2014 2014, UMBC Training Centers, LLC 17

Hadoop 2.x Improvements Hadoop 2.0 released in Fall 2013 Apache YARN (Yet Another Resource Negotiator) becomes a sub-project of Apache Hadoop and a core component of Hadoop 2.x De-couples MapReduce resource management & scheduling from data processing Enables Hadoop to serve as a general data processing platform that is not constrained to MapReduce and batch-oriented applications Resources manages similarly to how an operating system handles jobs Hadoop now supports a broader array of applications including real-time streaming data (Apache Storm) and interactive querying (Apache Tez) 6/26/2014 2014, UMBC Training Centers, LLC 18

Hadoop 2.x Improvements HDFS improvements (HDFS2) Java API changes Ambari graphical tool for cluster creation, configuration management, administration & monitoring Additional open source projects built on top of YARN (graph processing, search, in-memory computing, stream procession) Hadoop now looks much more like an OS for Big Data Currently very early in the cycle of adoption & conversion 6/26/2014 2014, UMBC Training Centers, LLC 19

Hadoop 1.0 to 2.0 Source: http://hortonworks.com/labs/yarn/ 6/26/2014 2014, UMBC Training Centers, LLC 20

Hadoop 1.0 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ 6/26/2014 2014, UMBC Training Centers, LLC 21

Hadoop 2.0 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ 6/26/2014 2014, UMBC Training Centers, LLC 22

YARN Applications Hadoop 1.0 - JobTracker responsible for resource management, job scheduling, job monitoring within a MapReduce compute model Hadoop 2.0 a central ResourceManager and the per-node NodeManager form a generic system for managing disributed applications of any type ResourceManager is scheduler/arbiter of cluster resources 6/26/2014 2014, UMBC Training Centers, LLC 23

YARN Applications ApplicationMaster per application (e.g MapReduce, Graph Processing, Message Passing); responsible for negotiating resources, execute and monitor compute jobs ( Containers ), track status Generalized resource model (hostname, rackname, CPU, memory) Containers can be tuned to the application needs Applications can launch any type of process, not just Java or MapReduce 6/26/2014 2014, UMBC Training Centers, LLC 24

Hadoop 2.0 / YARN Source: http://hortonworks.com/get-started/yarn/ 6/26/2014 2014, UMBC Training Centers, LLC 25

Hadoop OS for Big Data? Linux Kernel / Apache Hadoop core foundation Need other open tools that run on Linux / Hadoop: Linux: GNU utilities, GUI / X-Windows; Programming Languages (Perl, Python, PHP); Servers (Web, Samba, Database); Applications Hadoop: Pig, Hive, HBase, Storm, Kafka, Ambari, YARN, Tez Distribution ( Distro ) Linux: Red Hat, Fedora, CentOS, SUSE, Debian, Ubuntu, Mint Hadoop: Hortonworks Data Platform (HDP 2.4), Cloudera Distribution for Hadoop (CDH 5.0), MapR Enterprise, Pivotal HD 6/26/2014 2014, UMBC Training Centers, LLC 26

Streaming / Realtime Analytics Advertising: analyzing 100K data points every second and making a decision on each in 10ms Finance: processing continuous market ticks, transactions, and news event for trading decisions (beating your rival to the trigger) Cyber / IT monitoring network packets and log data for breaches Social Media analyzing tweets on a keyword from Twitter API 6/26/2014 2014, UMBC Training Centers, LLC 27

Streaming Tech Challenge SQL systems can t maintain those data rates Hadoop MapReduce: the parallelism that works well for huge data sets has too much latency Other potential approaches: In-memory database CEP systems, e.g. Tibco Size and/or scale-out limited Out of sequence data poses additional challenge 6/26/2014 2014, UMBC Training Centers, LLC 28

Possible Solution Ingestion: Apache Kafka Kafka is a distributed, partitioned, replicated commit log service that uses a producer consumer model Designed for real time activity streams (initially at LinkedIn) Maintains feeds of messages in topics Can provider higher gaurantees of message arrival & order 6/26/2014 2014, UMBC Training Centers, LLC 29

Possible Solution Realtime Computation: Apache Storm Storm is distributed realtime computation system Storm makes it easy to reliably process unbounded streams of data It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate (Storm came from Twitter) a benchmark clocked it at over a million tuples processed per second per node A Storm cluster is similar to a Hadoop cluster. Whereas on Hadoop you run MapReduce jobs, on Storm you run topologies A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes 6/26/2014 2014, UMBC Training Centers, LLC 30

Possible Solution Realtime Computation: Apache Storm The basic primitives Storm provides for doing stream transformations are spouts and bolts. A spout is a source of streams, e.g. a spout may connect to the Twitter API and emit a stream of tweets. A bolt consumes any number of input streams, does some processing, and possibly emits new streams Networks of spouts and bolts are packaged into a topology which is the top-level abstraction that you submit to Storm clusters for execution 6/26/2014 2014, UMBC Training Centers, LLC 31

Possible Solution Persistence: In-memory database: MemSQL, Spark (with Shark?) NoSQL: Cassandra, MongoDB, CouchDB, Redis Archival / Batch Hadoop / HDFS Integration / Resource Management: YARN Configuration: Zookeeper Administration / Monitoring: Ambari Variant of this architecture has been utilized in AWS to process enourmous data rates / volumes on click streams, Twitter messages, and other Internet streams 6/26/2014 2014, UMBC Training Centers, LLC 32

Problems of Interest in Cyber Log Aggregation and Analysis System Monitoring / Management Real Time Threat Detection / Mitigation Privacy / Confidentiality Data Leakage / Exfiltration Espionage / Counter Intelligence Insider Threat Compliance 6/26/2014 2014, UMBC Training Centers, LLC 33

Problems of Interest in Cyber Malware / APT Battlefield Logistics Tracking Targets of Interest Forensics Criminal Behavior Movement of Money Strategy Testing Relationship (Social) Graphing 6/26/2014 2014, UMBC Training Centers, LLC 34

Analytics Project Considerations With the Hadoop 2 Framework and Ecosystem, you can do analytics at any scale and time frame For complex analytics, data science provides a wide range of algorithms for various categories of problems Public (AWS) and private (OpenStack, VMware) cloud computing frameworks have evolved to big data scale You don t need to create anything But you do need to choose the right tools for the job And architecting & scaling for production will still be challenge at scale 6/26/2014 2014, UMBC Training Centers, LLC 35

Analytics Project Considerations What is the problem to be solved What data exists / what is missing Cost / benefit of data analytics Does the Analytic make sense / validity How will I utilize the results (decisions) How do I represent meaningfully (reports, dashboards, visualizations) Real data is always messy! 6/26/2014 2014, UMBC Training Centers, LLC 36

Data Analytics Project Teams 6/26/2014 2014, UMBC Training Centers, LLC 37

Analytics Project Considerations System Administration Ongoing Operations Security Legal & Policy Governance Ethics Social / Political Ramifications 6/26/2014 2014, UMBC Training Centers, LLC 38

Contact Information UMBC Training Centers Phone: (443) 692-6600 Web: www.umbctraining.com Data Analytics Training Programs: http://umbc.tc/data Jon Lau, Vice President: (443) 692-6597, jlau@umbctraining.com UMBC Training Centers 6996 Columbia Gateway Drive Columbia, MD 21046 6/26/2014 39