Big Data Analytics for Cyber AFCEA International Cyber Symposium June 24, 2014 Jon Lau, Vice President and CTO UMBC Training Centers 6/26/2014 umbctraining.com 443-692-6600 1
Agenda About UMBC & UMBC Training Centers The Interest in Data Science / Analytics Some History Big Data / Internet Scale Hadoop 1.0 Hadoop 2.0 / YARN Streaming Analytics Cyber Applications Data Analytics Project Considerations
About UMBC Quick Facts UMBC: University of Maryland, Baltimore County Founded in 1966 Member of the University System of Maryland Over 12,000 undergraduate, graduate and Ph.D students Strengths in Science, Technology, Engineering, Math, and Education Research intensive UMBC receives approximately $100M in annual federal funding for research http://www.umbc.edu
About UMBC Training Centers Founded by UMBC in 2000 Applied, non-credit training programs for working professionals and organizations Core programs area include: Information Technology Cyber Security Engineering (Electrical, Mechanical, Environmental) Systems Engineering Management, Leadership and Innovation Inst http://www.umbctraining.com 6/26/2014 2014, UMBC Training Centers, LLC 4
Data Analytics & Big Data The past fifteen years have seen extensive investments in business infrastructure, which have improved the ability to collect data throughout the enterprise. Virtually every aspect of business is now open to data collection and often even instrumented for data collection: operations, manufacturing, supply-chain management, customer behavior, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on external events such as market trends, industry news, and competitors movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data the realm of data science. --from Data Science for Business by Foster Provost and Tom Fawcett; Publisher: O'Reilly Media, Inc., 2013. 6/26/2014 2014, UMBC Training Centers, LLC 5
Real World Examples Fraud Detection (Credit Cards, Insurance Claims) Tax Compliance (IRS) Advertising (Google, Yahoo) Customer Recommendations (Amazon, Netflix) Customer Retention (Verizon, Comcast) Intelligence Analysis / Cybersecurity Law Enforcement: Criminal Activity Tracking & Prediction Manufacturing: Quality and Process Improvement Finance / Investment Management 6/26/2014 2014, UMBC Training Centers, LLC 6
History Is this New? FOCUS on the Mainframe SQL Ad Hoc Query Tools SAS OLAP Business Intelligence Data Warehouse Sybase IQ Object Oriented Databases for Time Series Analysis Unix / Open Systems / Distributed Computing 6/26/2014 2014, UMBC Training Centers, LLC 7
History Is this New? Operations Research Simulation and Modeling Optimization Statistics Machine Learning Expert Systems Data Science Data Mining 6/26/2014 2014, UMBC Training Centers, LLC 8
What is New? Everything is a Computer With Ubiquitious Internet Connectivity That Tells you Everything about Itself, Constantly Including People (Social Media, Geolocation, Privacy becoming a quaint anachronism) Big Data Volume GB, TB, PB, EB, ZB, YB Velocity Mbps, Gbps, Tbps? Variety Unstructured, Multimedia, Time Series 6/26/2014 2014, UMBC Training Centers, LLC 9
Key Terms and Buzzwords 2011, Intel Free Press 6/26/2014 2014, UMBC Training Centers, LLC 10
Big Data The Problem Twitter: 58 million tweets per day Facebook: 1.3 billion users; 300 PB data + 600 TB daily LinkedIn: Over 225 million members YouTube: 100 hours of video uploaded every minute Large Hadron Collider 10s of PB per year Large Synoptic Survey Telescope 30TB per night Boeing jet generates 10 TB of information per engine every 30 minutes of flight Internet of Things sensors everywhere Existing Tech Fails: SQL, CPU, RAM, proprietary software 6/26/2014 2014, UMBC Training Centers, LLC 11
New Tech: Capitalism + Socialism? New Internet Business Models + Billions of $$ in Venture Capital = Massive Incentive to Innovate Free and Open Source Software (FOSS) Model fuels a Philosophy, Culture, and Ecosystem of Innovation Red Hat, Ubuntu: Linux Google, Yahoo: Hadoop Facebook - Cassandra, and many others LinkedIn - Kafka, Voldemort,... Amazon AWS Netflix - many... 6/26/2014 2014, UMBC Training Centers, LLC 12
Big Data 1.0 MapReduce / Hadoop Developed at Google (Google File System, Nutch) Evolved to Hadoop / MapReduce at Yahoo Designed to solve problems at Internet scale (i.e. Search) Very Simple Idea: Map a large data set into discrete chunks or blocks Merge / aggregate ( Reduce ) the data to produce final result Move the computing to the data (reversing traditional models) Spread the computing work across 10s, 100s, 1000s of nodes Build on top of a robust distributed file system (HDFS) 6/26/2014 2014, UMBC Training Centers, LLC 13
MapReduce in Hadoop 6/26/2014 2014, UMBC Training Centers, LLC 14
Scaling Up vs. Scaling Out Scaling up: Increase processing on one machine (highend server processor, RAM, storage) Scaling out: many machines running in parallel on same local network 6/26/2014 2014, UMBC Training Centers, LLC 15
Hadoop 1.x Success Became the open source Apache Hadoop project Other synergistic tools were developed to leverage or extend MapReduce and HDFS (e.g. Pig, Hive) Leverages the inherent openness and scalability of Linux Many real world problems fit the MapReduce model Extremely successful in Internet scale applications: Yahoo 30,000+ node Hadoop cluster Facebook 100 PB cluster Many others (Netflix, New York Times, Government) 6/26/2014 2014, UMBC Training Centers, LLC 16
Hadoop 1.x Limitations MapReduce not ideal for many applications Limited to batch-oriented processing not well suited for interactive or real time / streaming applications Resource management constrained by JobTracker architecture Scalability bottlenecks in Mapper & Reducer utilization 6/26/2014 2014, UMBC Training Centers, LLC 17
Hadoop 2.x Improvements Hadoop 2.0 released in Fall 2013 Apache YARN (Yet Another Resource Negotiator) becomes a sub-project of Apache Hadoop and a core component of Hadoop 2.x De-couples MapReduce resource management & scheduling from data processing Enables Hadoop to serve as a general data processing platform that is not constrained to MapReduce and batch-oriented applications Resources manages similarly to how an operating system handles jobs Hadoop now supports a broader array of applications including real-time streaming data (Apache Storm) and interactive querying (Apache Tez) 6/26/2014 2014, UMBC Training Centers, LLC 18
Hadoop 2.x Improvements HDFS improvements (HDFS2) Java API changes Ambari graphical tool for cluster creation, configuration management, administration & monitoring Additional open source projects built on top of YARN (graph processing, search, in-memory computing, stream procession) Hadoop now looks much more like an OS for Big Data Currently very early in the cycle of adoption & conversion 6/26/2014 2014, UMBC Training Centers, LLC 19
Hadoop 1.0 to 2.0 Source: http://hortonworks.com/labs/yarn/ 6/26/2014 2014, UMBC Training Centers, LLC 20
Hadoop 1.0 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ 6/26/2014 2014, UMBC Training Centers, LLC 21
Hadoop 2.0 http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ 6/26/2014 2014, UMBC Training Centers, LLC 22
YARN Applications Hadoop 1.0 - JobTracker responsible for resource management, job scheduling, job monitoring within a MapReduce compute model Hadoop 2.0 a central ResourceManager and the per-node NodeManager form a generic system for managing disributed applications of any type ResourceManager is scheduler/arbiter of cluster resources 6/26/2014 2014, UMBC Training Centers, LLC 23
YARN Applications ApplicationMaster per application (e.g MapReduce, Graph Processing, Message Passing); responsible for negotiating resources, execute and monitor compute jobs ( Containers ), track status Generalized resource model (hostname, rackname, CPU, memory) Containers can be tuned to the application needs Applications can launch any type of process, not just Java or MapReduce 6/26/2014 2014, UMBC Training Centers, LLC 24
Hadoop 2.0 / YARN Source: http://hortonworks.com/get-started/yarn/ 6/26/2014 2014, UMBC Training Centers, LLC 25
Hadoop OS for Big Data? Linux Kernel / Apache Hadoop core foundation Need other open tools that run on Linux / Hadoop: Linux: GNU utilities, GUI / X-Windows; Programming Languages (Perl, Python, PHP); Servers (Web, Samba, Database); Applications Hadoop: Pig, Hive, HBase, Storm, Kafka, Ambari, YARN, Tez Distribution ( Distro ) Linux: Red Hat, Fedora, CentOS, SUSE, Debian, Ubuntu, Mint Hadoop: Hortonworks Data Platform (HDP 2.4), Cloudera Distribution for Hadoop (CDH 5.0), MapR Enterprise, Pivotal HD 6/26/2014 2014, UMBC Training Centers, LLC 26
Streaming / Realtime Analytics Advertising: analyzing 100K data points every second and making a decision on each in 10ms Finance: processing continuous market ticks, transactions, and news event for trading decisions (beating your rival to the trigger) Cyber / IT monitoring network packets and log data for breaches Social Media analyzing tweets on a keyword from Twitter API 6/26/2014 2014, UMBC Training Centers, LLC 27
Streaming Tech Challenge SQL systems can t maintain those data rates Hadoop MapReduce: the parallelism that works well for huge data sets has too much latency Other potential approaches: In-memory database CEP systems, e.g. Tibco Size and/or scale-out limited Out of sequence data poses additional challenge 6/26/2014 2014, UMBC Training Centers, LLC 28
Possible Solution Ingestion: Apache Kafka Kafka is a distributed, partitioned, replicated commit log service that uses a producer consumer model Designed for real time activity streams (initially at LinkedIn) Maintains feeds of messages in topics Can provider higher gaurantees of message arrival & order 6/26/2014 2014, UMBC Training Centers, LLC 29
Possible Solution Realtime Computation: Apache Storm Storm is distributed realtime computation system Storm makes it easy to reliably process unbounded streams of data It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate (Storm came from Twitter) a benchmark clocked it at over a million tuples processed per second per node A Storm cluster is similar to a Hadoop cluster. Whereas on Hadoop you run MapReduce jobs, on Storm you run topologies A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes 6/26/2014 2014, UMBC Training Centers, LLC 30
Possible Solution Realtime Computation: Apache Storm The basic primitives Storm provides for doing stream transformations are spouts and bolts. A spout is a source of streams, e.g. a spout may connect to the Twitter API and emit a stream of tweets. A bolt consumes any number of input streams, does some processing, and possibly emits new streams Networks of spouts and bolts are packaged into a topology which is the top-level abstraction that you submit to Storm clusters for execution 6/26/2014 2014, UMBC Training Centers, LLC 31
Possible Solution Persistence: In-memory database: MemSQL, Spark (with Shark?) NoSQL: Cassandra, MongoDB, CouchDB, Redis Archival / Batch Hadoop / HDFS Integration / Resource Management: YARN Configuration: Zookeeper Administration / Monitoring: Ambari Variant of this architecture has been utilized in AWS to process enourmous data rates / volumes on click streams, Twitter messages, and other Internet streams 6/26/2014 2014, UMBC Training Centers, LLC 32
Problems of Interest in Cyber Log Aggregation and Analysis System Monitoring / Management Real Time Threat Detection / Mitigation Privacy / Confidentiality Data Leakage / Exfiltration Espionage / Counter Intelligence Insider Threat Compliance 6/26/2014 2014, UMBC Training Centers, LLC 33
Problems of Interest in Cyber Malware / APT Battlefield Logistics Tracking Targets of Interest Forensics Criminal Behavior Movement of Money Strategy Testing Relationship (Social) Graphing 6/26/2014 2014, UMBC Training Centers, LLC 34
Analytics Project Considerations With the Hadoop 2 Framework and Ecosystem, you can do analytics at any scale and time frame For complex analytics, data science provides a wide range of algorithms for various categories of problems Public (AWS) and private (OpenStack, VMware) cloud computing frameworks have evolved to big data scale You don t need to create anything But you do need to choose the right tools for the job And architecting & scaling for production will still be challenge at scale 6/26/2014 2014, UMBC Training Centers, LLC 35
Analytics Project Considerations What is the problem to be solved What data exists / what is missing Cost / benefit of data analytics Does the Analytic make sense / validity How will I utilize the results (decisions) How do I represent meaningfully (reports, dashboards, visualizations) Real data is always messy! 6/26/2014 2014, UMBC Training Centers, LLC 36
Data Analytics Project Teams 6/26/2014 2014, UMBC Training Centers, LLC 37
Analytics Project Considerations System Administration Ongoing Operations Security Legal & Policy Governance Ethics Social / Political Ramifications 6/26/2014 2014, UMBC Training Centers, LLC 38
Contact Information UMBC Training Centers Phone: (443) 692-6600 Web: www.umbctraining.com Data Analytics Training Programs: http://umbc.tc/data Jon Lau, Vice President: (443) 692-6597, jlau@umbctraining.com UMBC Training Centers 6996 Columbia Gateway Drive Columbia, MD 21046 6/26/2014 39