Cloud Computing Where ISR Data Will Go for Exploitation



Similar documents
Hadoop Architecture. Part 1

THE HADOOP DISTRIBUTED FILE SYSTEM

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CSE-E5430 Scalable Cloud Computing Lecture 2

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Open source Google-style large scale data analysis with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Driving Big Data With Big Compute

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

The Greenplum Analytics Workbench

Hadoop. Sunday, November 25, 12

Hadoop Cluster Applications

Deploying Hadoop with Manager

Hadoop IST 734 SS CHUNG

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Reduction of Data at Namenode in HDFS using harballing Technique

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Dell Reference Configuration for Hortonworks Data Platform

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

International Journal of Computer & Organization Trends Volume20 Number1 May 2015

Introduction to Cloud Computing

Big Data and Cloud Computing for GHRSST

Cloud computing - Architecting in the cloud

Hadoop on the Gordon Data Intensive Cluster

Open source large scale distributed data management with Google s MapReduce and Bigtable

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apache Hadoop. Alexandru Costan

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop: Embracing future hardware

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Distributed File Systems

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud Computing through Virtualization and HPC technologies

Hadoop Distributed File System (HDFS) Overview

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MapReduce, Hadoop and Amazon AWS

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Enabling Technologies for Distributed Computing

Similarity Search in a Very Large Scale Using Hadoop and HBase

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Oracle Big Data SQL Technical Update

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Data Mining for Data Cloud and Compute Cloud

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

International Journal of Advance Research in Computer Science and Management Studies

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

GraySort and MinuteSort at Yahoo on Hadoop 0.23

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

GeoGrid Project and Experiences with Hadoop

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Data Centers and Cloud Computing

HDFS. Hadoop Distributed File System

bigdata Managing Scale in Ontological Systems

Accelerating and Simplifying Apache

Application Development. A Paradigm Shift

Cloud Computing and Amazon Web Services

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Apache Hadoop FileSystem and its Usage in Facebook

Cloud JPL Science Data Systems

Big Data With Hadoop

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

A very short Intro to Hadoop

With DDN Big Data Storage

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Design and Evolution of the Apache Hadoop File System(HDFS)

Networking in the Hadoop Cluster

Data Centers and Cloud Computing. Data Centers

Enabling Technologies for Distributed and Cloud Computing

Architectures for Big Data Analytics A database perspective

Hadoop & its Usage at Facebook

SMB Direct for SQL Server and Private Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Adobe Deploys Hadoop as a Service on VMware vsphere

Hortonworks Data Platform Reference Architecture

Transcription:

Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Cloud HPC- 1

Outline Introduction Cloud Supercomputing Persistent surveillance requirements Data Intensive cloud computing Integration with Supercomputing System Preliminary Results Summary Cloud HPC- 2

Persistent Surveillance: The New Knowledge Pyramid Act Tip/Queue Risk Tracks Detects Images Shadow-200 Raw Data Global Hawk Global Hawk Global Hawk Global Hawk Predator Predator Predator Predator Shadow-200 Shadow-200 Shadow-200 DoD missions must exploit High resolution sensors Integrated multi-modal data Short reaction times Many net-centric users Cloud HPC- 3

Bluegrass Dataset (detection/tracking) GMTI Wide-Area EO High-Res EO (Twin Otter) Vehicle Ground Truth Cues Terabytes of data; multiple classification levels; multiple teams Enormous computation to test new detection and tracking algorithms Cloud HPC- 4

Persistent Surveillance Data Rates Persistent Surveillance requires watching large areas to be most effective Surveilling large areas produces enormous data streams Must use distributed storage and exploitation Cloud HPC- 5

Cloud Computing Concepts Data Intensive Computing Compute architecture for large scale data analysis Billions of records/day, trillions of stored records, petabytes of storage o Google File System 2003 o Google MapReduce 2004 o Google BigTable 2006 Design Parameters Performance and scale Optimized for ingest, query and analysis Co-mingled data Relaxed data model Simplified programming Community: Utility Computing Compute services for outsourcing IT Concurrent, independent users operating across millions of records and terabytes of data o IT as a Service o Infrastructure as a Service (IaaS) o Platform as a Service (PaaS) o Software as a Service (SaaS) Design Parameters Isolation of user data and computation Portability of data with applications Hosting traditional applications Lower cost of ownership Capacity on demand Community: Cloud HPC- 6

Advantages of Data Intensive Cloud: Disk Bandwidth Traditional: Data from central store to compute nodes Scheduler Cloud: Data replicated on nodes, computation sent to nodes Scheduler C/C++ C/C++ C/C++ C/C++ Cloud computing moves computation to data Good for applications where time is dominated by reading from disk Replaces expensive shared memory hardware and proprietary database software with cheap clusters and open source Scalable to hundreds of nodes Cloud HPC- 7

Outline Introduction Cloud Supercomputing Integration with Supercomputing System Cloud stack Distributed file systems Distributed database Distributed execution Preliminary Results Summary Cloud HPC- 8

Cloud Software: Hybrid Software Stacks Cloud implementations can be developed from a large variety of software components Many packages provide overlapping functionality Effective migration of DoD to a cloud architecture will require mapping core functions to the cloud software stack Most likely a hybrid stack with many component packages MIT-LL has developed a dynamic cloud deployment architecture on its computing infrastructure Examining performance trades across software components Services Sector Cloud Services MapReduce HBase Cloud Storage Applications Application Framework Job Control HDFS Linux OS Hardware App Services App Servers Relational DB Distributed file systems File-based: Sector Block-based: Hadoop DFS Distributed database: HBase Compute environment: Hadoop MapReduce Cloud HPC- 9

P2P File system (e.g., Sector) Client SSL Manager SSL Security Server Data Workers Low-cost, file-based, read-only, replicating, distributed file system Manager maintains metadata of distributed file system Security Server maintains permissions of file system Good for mid sized files (Megabytes) Cloud HPC- 10 Holds data files from sensors

Parallel File System (e.g., Hadoop DFS) Client Metadata Namenode Data Datanodes Low-cost, block-based, read-only, replicating, distributed file system Namenode maintains metadata of distributed file system Good for very large files (Gigabyte) Tar balls of lots of small files (e.g., html) Cloud HPC- 11 Distributed databases (e.g. HBase)

Distributed Database (e.g., HBase) Client Metadata Namenode Data Datanodes Database tablet components spread over distributed block-based file system Optimized for insertions and queries Stores metadata harvested from sensor data (e.g., keywords, locations, Cloud HPC- 12 file handle, )

Distributed Execution (e.g., Hadoop MapReduce, Sphere) Client Metadata Namenode Data Datanodes Map MapReduce Map MapReduce Each Map instance executes locally on a block of the specified files Each Reduce instance collects and combines results from Map instances No communication between Map instances All intermediate results are passed through Hadoop DFS Used to process ingested data (metadata extraction, etc.) Cloud HPC- 13

Hadoop Cloud Computing Architecture 2 2 1 LLGrid Cluster 2 1 Sector- Sector- Sector- Sphere Sector- Sphere Sector- Sphere Sphere Sphere 4 Hadoop Hadoop Datanod Hadoop Datanod e Hadoop Datanod e Hadoop Datanod e Datanod e e 1 3 Hadoop Namenode/ Sector Manager/ Sphere JobMaster 5 6 Sequence of Actions 1. Active folders register intent to write data to Sector. Manager replies with Sector worker addresses to which data should be written. 2. Active folders write data to Sector workers. 3. Manager launches Sphere MapReduce-coded metadata ingester onto Sector data files. 4. MapReduce-coded ingesters insert metadata into Hadoop HBase database. 5. Client submits queries on Hbase metadata entries. 6. Client fetches data products from Sector workers. Cloud HPC- 14

Examples Scheduler Scheduler C/C++ C/C++ C/C++ C/C++ Compare accessing data Central parallel file system (500 MB/s effective bandwidth) Local RAID file system (100 MB/s effective bandwidth) In data intensive case, each data file is stored on local disk in its entirety Only considering disk access time Assume no network bottlenecks Assume simple file system accesses Cloud HPC- 15

E/O Photo Processing App Model Two stages Determine features in each photo Correlate features between current photo and every other photo Photo size: 4.0 MB each Feature results file size: 4.0 MB each Total photos: 30,000 Cloud HPC- 16

Persistent Surveillance Tracking App Model Each processor tracks region of ground in series of images Results are saved in distributed file system Image size: 16 MB Track results: 100 kb Number of images: 12,000 Cloud HPC- 17

Outline Introduction Cloud Supercomputing Integration with Supercomputing System Preliminary Results Cloud scheduling environment Dynamic Distributed Dimensional Data Model (D4M) Summary Cloud HPC- 18

Cloud Scheduling Two layers of Cloud scheduling Scheduling the entire Cloud environment onto compute nodes Cloud environment on single node as single process Cloud environment on single node as multiple processes Cloud environment on multiple nodes (static node list) Cloud environment instantiated through scheduler, including Torque/PBS/Maui, SGE, LSF (dynamic node list) Scheduling MapReduce jobs onto nodes in Cloud environment First come, first served Priority scheduling No scheduling for non-mapreduce clients No scheduling of parallel jobs Cloud HPC- 19

Cloud vs Parallel Computing Parallel computing APIs assume all compute nodes are aware of each other (e.g., MPI, PGAS, ) Cloud computing API assumes a distributed computing programming model (computed nodes only know about manager) However, cloud infrastructure assumes parallel computing hardware (e.g., Hadoop DFS allows for direct comm between nodes for file block replication) Challenge: how to get best of both worlds? Cloud HPC- 20

D4M: Parallel Computing on the Cloud Client SSL Manager SSL Security Server Data Workers D4M launches traditional parallel jobs (e.g., pmatlab) onto Cloud environment Each process of parallel job launched to process one or more documents in DFS Launches jobs through scheduler like LSF, PBS/Maui, SGE Enables more tightly-coupled analytics Cloud HPC- 21

Outline Introduction Cloud Supercomputing Integration with Supercomputing System Preliminary Results Distributed file systems D4M progress Summary Cloud HPC- 22

Distributed Cloud File Systems on TX-2500 Cluster Service Nodes Shared network storage LSF-HPC resource manager/ scheduler Distributed File System Metadata Distributed File System Data Nodes Rocks Mgmt, 411, Web Server, Ganglia Distributed File System Data Nodes To LLAN 432 PowerEdge 2850 Dual 3.2 GHz EM64-T Xeon (P4) 8 GB RAM memory Two Gig-E Intel interfaces Infiniband interface Six 300-GB disk drives Cloud HPC- 23 432+5 Nodes 864+10 CPUs 3.4 TB RAM 0.78 PB of Disk 28 Racks MIT-LL Cloud Number of nodes used File system size Replication factor Hadoop DFS Sector 350 350 298.9 TB 452.7 TB 3 2

D4M on LLGrid Client SSL Manager SSL Security Server Data Workers Demonstrated D4M on Hadoop DFS Demonstrated D4M on Sector DFS D4M on HBase (in progress) Cloud HPC- 24

Summary Persistent Surveillance applications will over-burden our current computing architectures Very high data rates Highly parallel, disk-intensive analytics Good candidate for Data Intensive Cloud Computing Components of Data Intensive Cloud Computing File- and block-based distributed file systems Distributed databases Distributed execution Lincoln has Cloud experimentation infrastructure Created >400 TB DFS Developing D4M to launch traditional parallel jobs on Cloud environment Cloud HPC- 25

Backups Cloud HPC- 26

Outline Introduction Persistent surveillance requirements Data Intensive cloud computing Cloud Supercomputing Cloud stack Distributed file systems Computational paradigms Distributed database-like hash stores Integration with supercomputing system Scheduling cloud environment Dynamic Distributed Dimensional Data Model (D4M) Preliminary results Summary Cloud HPC- 27

What is LLGrid? Service Nodes Compute Nodes Cluster Switch Users Network Storage Resource Manager LLAN Configuration Server Web Site FAQs To Lincoln LAN LAN Switch LLGrid is a ~300 user ~1700 processor system World s only desktop interactive supercomputer Dramatically easier to use than any other supercomputer Highest fraction of staff using (20%) supercomputing of any organization on the planet Foundation of Lincoln and MIT Campus joint vision for Engaging Supercomputing Cloud HPC- 28

Decision Support Diverse Computing Requirements SIGINT SAR and GMTI Detection & Tracking EO, IR, Hyperspectral, Ladar Exploitation Algorithm prototyping Front end Back end Exploitation Processor prototyping Embedded Cloud / Grid Graph Stage Signal & Image Processing / Calibration & registration Algorithms Front end signal & image processing Detection & tracking Back end signal & image processing Exploitation Graph analysis / data mining / knowledge extraction Data Sensor inputs Dense Arrays Graphs Kernels FFT, FIR, SVD, Kalman, MHT, BFS, DFS, SSSP, Architecture Embedded Cloud/Grid Cloud/Grid/Graph Efficiency 25% - 100% 10% - 25% < 0.1% Cloud HPC- 29

Elements of Data Intensive Computing Distributed File System Hadoop HDFS: Block-based data storage Sector FS: File-based data storage Distributed Execution Hadoop MapReduce: Independently parallel compute model Sphere: MapReduce for Sector FS D4M: Dynamic Distributed Dimensional Data Model Lightly-Structured Data Store Hadoop HBase: Distributed (hashed) data tables Cloud HPC- 30