Program Coupling and Parallel Data Transfer Using PAWS

Similar documents

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Apache Traffic Server Extensible Host Resolution

GPI Global Address Space Programming Interface

Parallel Computing. Parallel shared memory computing with OpenMP

OpenMP Programming on ScaleMP

Multi-GPU Load Balancing for Simulation and Rendering

Debugging with TotalView

InfiniteGraph: The Distributed Graph Database

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Data-Flow Awareness in Parallel Data Processing

Parallel and Distributed Computing Programming Assignment 1

Resource Utilization of Middleware Components in Embedded Systems

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

Load Imbalance Analysis

Practical Data Visualization and Virtual Reality. Virtual Reality VR Software and Programming. Karljohan Lundin Palmerius

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

Parallelization: Binary Tree Traversal

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Spring 2011 Prof. Hyesoon Kim

Parallel Computing. Shared memory parallel programming with OpenMP

SQL Server Array Library László Dobos, Alexander S. Szalay

Distributed communication-aware load balancing with TreeMatch in Charm++

Implementation Aspects of OO-Languages

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

BLM 413E - Parallel Programming Lecture 3

Index. AdWords, 182 AJAX Cart, 129 Attribution, 174

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

HPCC - Hrothgar Getting Started User Guide MPI Programming

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

OpenACC 2.0 and the PGI Accelerator Compilers

How to Configure the Workflow Service and Design the Workflow Process Templates

An Introduction to Parallel Computing/ Programming

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Cloud Computing at Google. Architecture

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Getting Started with Vision 6

An Incomplete C++ Primer. University of Wyoming MA 5310

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

HPC ABDS: The Case for an Integrating Apache Big Data Stack

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

Optimization tools. 1) Improving Overall I/O

Topological Properties

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Lecture 2 Parallel Programming Platforms

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

Power Reduction Techniques in the SoC Clock Network. Clock Power

Control 2004, University of Bath, UK, September 2004

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

C#5.0 IN A NUTSHELL. Joseph O'REILLY. Albahari and Ben Albahari. Fifth Edition. Tokyo. Sebastopol. Beijing. Cambridge. Koln.

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

32 K. Stride K 8 K 32 K 128 K 512 K 2 M

SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

IceWarp to IceWarp Server Migration

Capacity Planning Process Estimating the load Initial configuration

Availability Digest. Raima s High-Availability Embedded Database December 2011

Monitoring Infrastructure (MIS) Software Architecture Document. Version 1.1

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Cellular Computing on a Linux Cluster

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Apache Hama Design Document v0.6

vcenter Operations Manager Administration 5.0 Online Help VPAT

E) Modeling Insights: Patterns and Anti-patterns

DB2 for i. Analysis and Tuning. Mike Cain IBM DB2 for i Center of Excellence. mcain@us.ibm.com

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Specific Simple Network Management Tools

Hybrid Programming with MPI and OpenMP

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

KPACK: SQL Capacity Monitoring

Parallel Programming with MPI on the Odyssey Cluster

Software Engineering Best Practices. Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer

SQL Server 2005 Features Comparison

2. Metadata Modeling Best Practices with Cognos Framework Manager

Scientific Computing Programming with Parallel Objects

This course provides students with the knowledge and skills to develop ASP.NET MVC 4 web applications.

PicoScope 5000 Series (A API)

Automation Engine AE Server management

Streamline Computing Linux Cluster User Training. ( Nottingham University)

HPC Wales Skills Academy Course Catalogue 2015

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Big Data looks Tiny from the Stratosphere

Acronis Backup & Recovery: Events in Application Event Log of Windows

Lab 3: Introduction to Data Acquisition Cards

BSC vision on Big Data and extreme scale computing

Optimizing Application Performance with CUDA Profiling Tools

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Network Monitoring. SAN Discovery and Topology Mapping. Device Discovery. Send documentation comments to

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Transcription:

Program Coupling and Parallel Data Transfer Using PAWS Sue Mniszewski, CCS-3 Pat Fasel, CCS-3 Craig Rasmussen, CCS-1 http://www.acl.lanl.gov/paws

2 Primary Goal of PAWS Provide the ability to share data between two separate, parallel applications, using the maximum bandwidth available

3 Challenges Coupling of parallel applications Avoid serialization Provide several synchronization strategies Connect parallel applications with unequal numbers of processes Be extensible to new, user-defined parallel distribution strategies

4 PAWS Design Goals Rewrite of PAWS 1 Same general philosophy Using Cheetah/MPI run-time C++, C, and Fortran APIs Separation of layout and data (Ports & Views) Still single-threaded Channel abstraction Action caching (no barriers!)

5 PAWS Port Concept Separates layout from data Has a shape, but no data type Ports are connected, not data Allows multiple datas to use a single port Scalar ports for int, float, double, string View ports for rectilinear parallel data

6 Sample PAWS Environment Application PAWS Controller Visualization Parallel Transfer

7 PAWS Controller PAWS Controller Application Database Port Database Database with registered applications Database with registered ports Initialize connection between ports

8 PAWS Application Must be told where to find Controller Use PAWS API to: Register app and ports with Controller Query Controller databases Make connections Send/Receive data Disconnect ports or app from Controller

9 Running Applications PAWS Launcher/Batch (bsub) mpirun..... Cheetah/MPI Groups App1 PAWS Controller App2 Script

10 PAWS Networks PAWS Launcher/Controller accepts a script to start apps and connect ports Controller with registered apps and ports Controller script connectpawsports ta1::a ta2::b1 connectpawsports ta2::b2 ta3::c paws go ta1 ta2 ta3 Port A Port B1 Port B2 Port C

11 Parallel Redistribution Each application can partition data differently Different numbers of processes Different parallel layouts

PAWS Data Model for 12 Rectilinear Data Parallel data objects in PAWS have two parts: A global domain G, and a list of subdomains { Di } D1 D2 D3 G Allocated blocks of memory, and mappings (known as views) from memory to the { Di } subdomains G and { Di } are all that is needed to compute a message schedule Views and allocated memory locations can change

Example of G and { Di } in 13 PAWS G D1 D2 D3

14 PAWS Data Model: Views Views describe how data is mapped from user storage to the subdomain blocks: User s allocated storage V1 V2 D1 D2 PAWS Data Model

PAWS Data Model: Types of 15 Views Ways that we can specify the view domain Vi: Strided normal D-dimensional domain, zero-based 1D example: Vi = {0, Asize-1, 1} means all of Ai. Indirection list: list of M indirection points, each a D- dimensional zero-based index into Ai domain. (future) 1D example: Vi = { {3,4}, {8,1}, {4,4} } is a 3-element list. Function view: generate the data from a userprovided function (or functor), instead of copying from memory. (future) The user could implement his/her own type of view. This will be the user-extensibility method in PAWS.

PAWS Data: Row- or 16 Column-major Organization of data in memory (row- or columnmajor) is a characteristic of the data storage, not the data model. Data model always stores sizes in same format. Dimension information is listed left-to-right, first dimension to last. Only use row- or column-major info to calculate final location in memory during transfers. So, R- or C-major is a characteristic of a view. Specify it when you specify a view.

17 Registering a Port Information specified when you register ports from the user s program: Initial data model info G and { Di } (data model) Input or Output Port type, scalar or distributed

18 Connecting Ports What info does both sides of a connection share? Data model info G and { Di } for both sides Dynamic/static nature of layout on both sides Send/Receive message schedules, if they can be computed

19 Synchronous Data Transfer Data is transferred when both sides indicate they are ready One side does a send(), the other a receive() Before transfer, users must describe G, { Di }, and the assignment of each Di to the processors when registering a Port. The user provides the views, Vi, and pointers to the allocated blocks of memory, in the send() and receive() calls.

// REGISTER WITH PAWS Paws::Application app(argc, argv); Send (C++) // CREATE PORT PAWS Code Example Paws::Representation rep(wholedom, mydom, /* REGISTER myrank); WITH PAWS */ Paws::Port A("A", rep, Paws::PAWS_OUT, paws_initialize(argc, Paws::PAWS_DISTRIBUTED); argv); Paws::Port I("I", Paws::PAWS_OUT, Paws::PAWS_SCALAR); Receive (C) app.addport(&i); app.addport(&a); /* CREATE PORTS */ int rep = paws_representation(wholedom, mydom, myrank); // ALLOCATE DATA AND CREATE VIEW int* data = new int[mysize]; int B = paws_view_port("b", rep, PAWS_IN); Paws::ViewArray<int> view(paws::paws_column); int J = paws_scalar_port("j", PAWS_IN); view.addviewblock(data, myallocdom, myviewdom); /* ALLOCATE DATA AND CREATE VIEW */ // READY app.ready(); // SEND DATA int iter = 100; I.send(&iter); for (i = 0; i < iter; i++) { A.send(view); } // CLEANUP app.finalize(); int* data = (int*) malloc(mysize * sizeof(int)); int view = paws_view(paws_integer, PAWS_ROW); paws_add_view_block(view, data, myallocdom, myviewdom); /* READY */ paws_ready(); /* RECEIVE DATA */ paws_scalar_receive(j, iter, PAWS_INTEGER, 1); for (i = 0; i < iter; i++) { } paws_view_receive(b, view); /* CLEANUP */ paws_finalize(); 20

21 Dynamically Resizing Data What if one side decided to resize or repartition? The other side must find out about new size or distribution of the other side A new message schedule must be calculated Memory may have to be reallocated PAWS provides an interface that: Lets the user find out the new global domain G and subdomains { Di } for the other side of the connection Lets the user reallocate their data if it is necessary Supports resize events from either side of the connection

22 Resize Code Example Application app(argc, argv); Send Application app(argc, argv); Receive // CREATE EMPTY REP AND VIEW // CREATE EMPTY REP AND VIEW Representation rep(myrank); Representation rep(myrank); Port A("A", rep, PAWS_OUT, PAWS_DISTRIBUTED); Port B("B", rep, PAWS_IN, PAWS_DISTRIBUTED); ViewArray<int> view(paws_column); ViewArray<int> view(paws_column); app.ready(); app.ready(); // RESIZE DATA ON EVERY SEND for (i = 0; i < iter; i++) { int* data = new int[newmysize]; rep.update(newwholedom, newmydom); view.update(data, newmyallocdom, newmyviewdom); A.resize(rep); A.send(view); } app.finalize(); // WAIT FOR SIZE ON EVERY RECEIVE for (i = 0; i < iter; i++) B.resizeWait(); newwholedom = B.domain(); newmydom = redistribute(newwholedom, rank); int* data = new int[newmysize]; rep.update(newwholedom, newmydom); view.update(data, newmyallocdom, newmyviewdom); B.update(rep); B.receive(view); } app.finalize();

23 Preliminary Results PAWS Timing Comparison, 128 X 128 X 128 data, 100 iterations Time(sec) Sync send/receive # of processors (send:receive)

24 Future Directions Asynchronous data transfer get(), unlock(), lock(), wait(), check() MPI 2 Communications More parallel data structures Particles, unstructured, trees, etc. Better scripting interface Common Component Architecture (CCA) style framework