Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Approved for Public Release. Distribution Unlimited.
Data Intensive Applications in T&E Win-T at ATC Automotive Data Analyzer at ATC Blue Force Tracker at WSMR
Big Data Pipeline
Computer Resources for Data Analysis DOD Supercomputing Resource Centers (DSRC) Name AFRL DSRC ARL DSRC ERDC DSRC NAVY DSRC MHPCC DSRC Location US Air Force Research Lab Wright Patterson AFB, OH US Army Research Lab Aberdeen Proving Ground, MD US Army Eng. Research and Dev. Ctr. Vicksburg, MS Navy DOD Supercomputing Resource Ctr. Stennis Space Center, MS Maui High Performance Computing Ctr. Maui, HI In addition to the five supercomputing centers HPCMP also supports several affiliated resource centers. Total Nodes 1092 Cores/Node 16 Intel 8 Core Sandy Core Type Bridge Core Speed 2.6 Ghz Memory/Node 32 GB Interconnect FDR-10 Infiniband OS Redhat Linux Pershing Supercomputer at ARL
Data Analysis Techniques - MPI Data analysis can be speeded up using MPI based parallel programming techniques. Programming Languages C/C++, Matlab and Python Data arrays are distributed across multiple processors. Processors communicate with each other using message passing. Message passing library MPI MPI has library routines for: send, receive, gather, scatter, broadcast and barrier etc. Computing the Sum of Two Arrays using Multiple Processors
Data Analysis Techniques - Mapreduce A technique introduced by Google to process large data sets in parallel. Consists of a map step, a shuffle step, and a reduce step. Mapreduce is not guaranteed to be fast. Advantage is scalability and fault tolerance. Works best with clusters configured in a special way. Apache Hadoop is a popular open source implementation of a complete distributed framework. Includes Hadoop Common, HDFS, Hadoop Yarn and Hadoop mapreduce.
Database Techniques SQL based relational databases have traditionally been used for organizing and managing data. A growing number of organizations are switching to non-relational (or NoSQL) databases. NoSQL databases are distributed and scales to hundreds of processors. Used for applications that does not require ACID compliance. Startup cost may be low but operational cost may be high. Several types of NoSQL database technologies currently available. Technology Names Key Features Key-Value Store Column-Oriented Dictionary based Dynamo DB, Level DB, Berkely DB Hadoop/Hbase, Cassandra, Amazon Simple DB Mongo DB, Couch DB Primary-key access only. Simple API. Scalable and Reliable Stores contents by column. Suitable for aggregating data. Uses MapReduce. Scalable and Reliable Used with loosely structured data Reduced complexity Adapts to changes Scalable and Reliable
Real Time Computing with Big Data T&E may involve real time processing of data streams (such as sensor data). Storm is a software framework distributed real time computation. A Storm cluster consists of spouts(source of stream) and bolts(intermediate processing and emits new streams). Scalable and fault tolerant. Can integrate with a database like Hbase.
User Productivity Enhancement Technology Transfer and Training (PETTT) HPCMP PETTT Program can provide support to DOD T&E users with their big data computing needs. PETTT Mission Enhance the productivity of the DoD HPC user community Transfers computational and computing technology into DoD from other government, industrial, and academic communities Delivers training and supports DoD users through education, knowledge, access, and HPC tools to maximize productivity Complements existing laboratory and test centers expertise with 34 onsite computational specialists
Acknowledgement This work was wholly supported by the High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement, Technology Transfer and Training (PETTT) Program, executed under contract GS04T09DBC0017 by the Engility Corporation.