Real-time Process Network Sonar Beamformer



Similar documents
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Architecture of Hitachi SR-8000

RevoScaleR Speed and Scalability

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Virtuoso and Database Scalability

Why Threads Are A Bad Idea (for most purposes)

Real Time Programming: Concepts

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Scalability and Classifications

Multi-GPU Load Balancing for Simulation and Rendering

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Overview and History of Operating Systems

Azul Compute Appliances

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Computer Graphics Hardware An Overview

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Chapter 6, The Operating System Machine Level

Operating System Tutorial

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

MAQAO Performance Analysis and Optimization Tool

Overlapping Data Transfer With Application Execution on Clusters

Matrix Multiplication

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

CHAPTER 15: Operating Systems: An Overview

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

<Insert Picture Here> Oracle In-Memory Database Cache Overview

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Distribution One Server Requirements

Chapter 18: Database System Architectures. Centralized Systems

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

How To Understand The History Of An Operating System

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Benchmarking Large Scale Cloud Computing in Asia Pacific

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Parallel Programming Survey

Stream Processing on GPUs Using Distributed Multimedia Middleware

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Informatica Ultra Messaging SMX Shared-Memory Transport

LEOSTREAM. Case Study Remote Access to High-Performance Applications

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Tableau Server 7.0 scalability

Processor Capacity Reserves: An Abstraction for Managing Processor Usage

Performance and scalability of a large OLTP workload

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Capacity Estimation for Linux Workloads

Parametric Analysis of Mobile Cloud Computing using Simulation Modeling

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Latency Considerations for 10GBase-T PHYs

Cluster Computing at HRI

LSKA 2010 Survey Report Job Scheduler

3 - Introduction to Operating Systems

End-user Tools for Application Performance Analysis Using Hardware Counters

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

HP Storage Essentials Storage Resource Management Software end-to-end SAN Performance monitoring and analysis

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Technical Paper. Performance and Tuning Considerations for SAS on Pure Storage FA-420 Flash Array

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

Thread level parallelism

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University


Trends in High-Performance Computing for Power Grid Applications

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

A Comparison of Distributed Systems: ChorusOS and Amoeba

Lecture 2 Parallel Programming Platforms

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

GiantLoop Testing and Certification (GTAC) Lab

Simplest Scalable Architecture

Performance brief for IBM WebSphere Application Server 7.0 with VMware ESX 4.0 on HP ProLiant DL380 G6 server

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Web Server Architectures

Module: Software Instruction Scheduling Part I

NVIDIA GeForce GTX 580 GPU Datasheet

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Lecture 3 Theoretical Foundations of RTOS

Course Development of Programming for General-Purpose Multicore Processors

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Performance Monitoring of Parallel Scientific Applications

LEVERAGING FPGA AND CPLD DIGITAL LOGIC TO IMPLEMENT ANALOG TO DIGITAL CONVERTERS

Scaling in a Hypervisor Environment

ultra fast SOM using CUDA

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

A Deduplication File System & Course Review

theguard! ApplicationManager System Windows Data Collector

Transcription:

Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories gallen@arlut.utexas.edu Brian L. Evans Dept. Electrical and Computer Engineering bevans@ece.utexas.edu The University of Texas at Austin, Austin, TX 78712-1084 http://signal.ece.utexas.edu/ Poster.fm Copyright 1998, The University of Texas. All rights reserved. Introduction Problem: Beamforming is computationally intensive Example: High-resolution sonar beamformers (GFLOPS) Generation 1: expensive custom digital hardware (large state machines) Generation 2: custom integration of programmable digital signal processors on commercial-off-the-shelf hardware (e.g. 120 DSPs in a VME rack) Objective: Unix workstation beamformers Analysis: evaluate performance of beamforming kernels and systems Modeling: capture parallelism, guarantee determinate bounded execution Implementation: use portable, scalable software to achieve real-time performance on commodity hardware and lower development costs Solution: Real-time beamforming on workstations Analysis: optimize kernels and profile beamfomers to measure scalability Modeling: Process Networks Implementation: Real-time POSIX threads using C++ on symmetric multiprocessor UltraSPARC-II workstation with native signal processing

Time-Domain Beamforming Delay and sum weighted outputs Geometrically project the elements onto a line to compute the time delays Projection for a beam pointing 20 off axis M Σ b(t) = α i x i (t τ i ) i=1 20 15 b(t) beam outputi xi(t) ith output τi ith delay y position, inches 10 5 20 αi ith weight 0-5 element projected element -20-15 -10-5 0 5 10 15 20 x position, inches Beamforming Quantized time delays perturb beam pattern Sample at just above the Nyquist rate Interpolate to obtain desired time-delay resolution Sample at interval Interpolate up to interval δ = /L α 1 Time delay at interval δ A/D Interpolate z-n 1 α M Σ b[n] A/D Interpolate z-n M Sensor Array Kernel implementation on UltraSPARC-II Highly optimized C++ (loop unrolling and SPARCompiler5.0EA) Currntly operating at 60% of peak, which is 2 FLOPs per cycle

Vertical Beamforming stave Multiple vertical transducers for every horizontal position Each vertical column is combined into a stave No time delay or interpolation is required Staves are calculated by a simple integer dot product Integer-to-float conversion must be performed Output must be interleaved Kernel implementation on UltraSPARC-II Native signal processing with Visual Instruction Set (VIS) Software prefetching to hide memory latency and keep the pipeline full Formal Design Methodology The Process Network model [Kahn, 1974] Superset of flow models of computation Captures concurrency and parallelism Provides correctness Guarantees determinate execution of the program A program is represented as a directed graph A P B Each node is an independent process Each edge is a one-way queue of Blocking reads, non-blocking writes, infinite queues Scheduling for bounded queues is possible [Parks, 1995] Blocking reads and writes Dynamically increase queue capacities to prevent artificial deadlock Fits the thread model of concurrent programming

Process Network Implementation Each node corresponds to a thread Implemented in C++ using POSIX Pthreads Low-overhead, high-performance, scalable Granularity larger than a thread context switch (~10 us) Symmetric multiprocessing operating system dynamically schedules threads Efficient utilization of multiple processors Optimize queues for high-throughput signal processing Nodes operate directly on queue memory, avoiding copying Queues use mirroring to keep contiguous Mirrored Queue region Mirror region Compensates for the lack of circular address buffers Queues trade-off memory usage for overhead Virtual memory manager maintains circularity System Implementation Vertical beamformer forms 3 sets of 80 staves from 10 vertical elements per stave Each horizontal beamformer forms 61 beams from the 80 staves, using a two-point interpolation filter 4 GFLOPS total computation Element Stave Fan 0 Three-fan Vertical Fan 1 500 MFLOPS Fan 2 40 MB/sec each 1200 MFLOPS each

Performance Results 100 trial mean execution time for 2.6 seconds of Sun Ultra Enterprise 4000 with eight 336-MHz UltraSPARC-IIs, 2 Gb RAM, running Solaris 2.6 seconds Implementation Time (s) MFLOPS Mbytes thread pool 3.607 3024.8 832 process network 3.354 3252.8 654 Execution time and MFLOPS vs CPUs 35 30 25 20 15 10 3500 3000 2500 2000 1500 1000 MFLOPS Process network is 7% faster than thread pool, overhead is small Process network uses 20% less memory with lower latency Process network scalability is nearly linear Will continue to scale with additional CPUs Real-time performance achievable with 12 CPUs 5 500 0 1 2 3 4 5 6 7 8 0 CPUs Conclusion Third generation beamformers: Workstation hardware Commodity hardware saves development/manufacturing costs Multiprocessor servers, native signal processing Upgradable hardware, Moore s Law Software model: Process Networks Captures parallelism, guarantees determinate bounded execution Portable, reusable, scalable C++ code High-performance, low overhead POSIX threads Symmetric multiprocessing operating system The example 4-GFLOPS 1-Gb 3-D sonar beamforming system does execute in real time using a Sun Ultra Enterprise 4000 server with twelve 336 MHz UltraSPARC- II CPUs with 14.5% to spare http://www.ece.utexas.edu/~allen/beamforming/