Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Similar documents
High Performance Computing Modernization Program Mass Storage

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

How To Scale Out Of A Nosql Database

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Big Data Training

HPCMP New Users Guide Who Are We? Distribution A: Approved for Public release; distribution is unlimited.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

NoSQL Data Base Basics

TRAINING PROGRAM ON BIGDATA/HADOOP

BIG DATA TRENDS AND TECHNOLOGIES

Can the Elephants Handle the NoSQL Onslaught?

Big Systems, Big Data

Big Data Course Highlights

Integrating Big Data into the Computing Curricula

Hadoop. Sunday, November 25, 12

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Real Time Big Data Processing

Structured Data Storage

Map Reduce & Hadoop Recommended Text:

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data Technologies Compared June 2014

Applications for Big Data Analytics

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Architectures for Big Data Analytics A database perspective

An Approach to Implement Map Reduce with NoSQL Databases

Cloud Computing Where ISR Data Will Go for Exploitation

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Big Data and Cloud Computing for GHRSST

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Chapter 7. Using Hadoop Cluster and MapReduce

Open Source Technologies on Microsoft Azure

Introduction to Spark

A Brief Outline on Bigdata Hadoop

Lustre * Filesystem for Cloud and Hadoop *

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

White Paper: What You Need To Know About Hadoop

NoSQL for SQL Professionals William McKnight

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Big Data Explained. An introduction to Big Data Science.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Implement Hadoop jobs to extract business value from large and varied data sets

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Infrastructures for big data

INTRODUCTION TO CASSANDRA

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data: Tools and Technologies in Big Data

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Big Data and Apache Hadoop s MapReduce

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Dell Reference Configuration for Hortonworks Data Platform

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Large scale processing using Hadoop. Ján Vaňo

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Cloud Scale Distributed Data Storage. Jürmo Mehine

Hadoop IST 734 SS CHUNG

Big Data Are You Ready? Thomas Kyte

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop implementation of MapReduce computational model. Ján Vaňo

Viswanath Nandigam Sriram Krishnan Chaitan Baru

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Dominik Wagenknecht Accenture

Lecture Data Warehouse Systems

Big Data Analytics. Lucas Rego Drumond

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Big Data and Industrial Internet

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Managing large clusters resources

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Microsoft Azure Data Technologies: An Overview

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Accelerating Cassandra Workloads using SanDisk Solid State Drives

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Transcription:

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Approved for Public Release. Distribution Unlimited.

Data Intensive Applications in T&E Win-T at ATC Automotive Data Analyzer at ATC Blue Force Tracker at WSMR

Big Data Pipeline

Computer Resources for Data Analysis DOD Supercomputing Resource Centers (DSRC) Name AFRL DSRC ARL DSRC ERDC DSRC NAVY DSRC MHPCC DSRC Location US Air Force Research Lab Wright Patterson AFB, OH US Army Research Lab Aberdeen Proving Ground, MD US Army Eng. Research and Dev. Ctr. Vicksburg, MS Navy DOD Supercomputing Resource Ctr. Stennis Space Center, MS Maui High Performance Computing Ctr. Maui, HI In addition to the five supercomputing centers HPCMP also supports several affiliated resource centers. Total Nodes 1092 Cores/Node 16 Intel 8 Core Sandy Core Type Bridge Core Speed 2.6 Ghz Memory/Node 32 GB Interconnect FDR-10 Infiniband OS Redhat Linux Pershing Supercomputer at ARL

Data Analysis Techniques - MPI Data analysis can be speeded up using MPI based parallel programming techniques. Programming Languages C/C++, Matlab and Python Data arrays are distributed across multiple processors. Processors communicate with each other using message passing. Message passing library MPI MPI has library routines for: send, receive, gather, scatter, broadcast and barrier etc. Computing the Sum of Two Arrays using Multiple Processors

Data Analysis Techniques - Mapreduce A technique introduced by Google to process large data sets in parallel. Consists of a map step, a shuffle step, and a reduce step. Mapreduce is not guaranteed to be fast. Advantage is scalability and fault tolerance. Works best with clusters configured in a special way. Apache Hadoop is a popular open source implementation of a complete distributed framework. Includes Hadoop Common, HDFS, Hadoop Yarn and Hadoop mapreduce.

Database Techniques SQL based relational databases have traditionally been used for organizing and managing data. A growing number of organizations are switching to non-relational (or NoSQL) databases. NoSQL databases are distributed and scales to hundreds of processors. Used for applications that does not require ACID compliance. Startup cost may be low but operational cost may be high. Several types of NoSQL database technologies currently available. Technology Names Key Features Key-Value Store Column-Oriented Dictionary based Dynamo DB, Level DB, Berkely DB Hadoop/Hbase, Cassandra, Amazon Simple DB Mongo DB, Couch DB Primary-key access only. Simple API. Scalable and Reliable Stores contents by column. Suitable for aggregating data. Uses MapReduce. Scalable and Reliable Used with loosely structured data Reduced complexity Adapts to changes Scalable and Reliable

Real Time Computing with Big Data T&E may involve real time processing of data streams (such as sensor data). Storm is a software framework distributed real time computation. A Storm cluster consists of spouts(source of stream) and bolts(intermediate processing and emits new streams). Scalable and fault tolerant. Can integrate with a database like Hbase.

User Productivity Enhancement Technology Transfer and Training (PETTT) HPCMP PETTT Program can provide support to DOD T&E users with their big data computing needs. PETTT Mission Enhance the productivity of the DoD HPC user community Transfers computational and computing technology into DoD from other government, industrial, and academic communities Delivers training and supports DoD users through education, knowledge, access, and HPC tools to maximize productivity Complements existing laboratory and test centers expertise with 34 onsite computational specialists

Acknowledgement This work was wholly supported by the High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement, Technology Transfer and Training (PETTT) Program, executed under contract GS04T09DBC0017 by the Engility Corporation.