Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems



Similar documents
The dinner table problem: the rectangular case

Periodic Review Probabilistic Multi-Item Inventory System with Zero Lead Time under Constraints and Varying Order Cost

Two degree of freedom systems. Equations of motion for forced vibration Free vibration analysis of an undamped system

On the Optimality and Interconnection of Valiant Load-Balancing Networks

Understanding Financial Management: A Practical Guide Guideline Answers to the Concept Check Questions

ANNUITIES SOFTWARE ASSIGNMENT TABLE OF CONTENTS... 1 ANNUITIES SOFTWARE ASSIGNMENT... 2 WHAT IS AN ANNUITY?... 2 EXAMPLE QUESTIONS...

Money Math for Teens. Introduction to Earning Interest: 11th and 12th Grades Version

between Modern Degree Model Logistics Industry in Gansu Province 2. Measurement Model 1. Introduction 2.1 Synergetic Degree

Learning Objectives. Chapter 2 Pricing of Bonds. Future Value (FV)

Annuities and loan. repayments. Syllabus reference Financial mathematics 5 Annuities and loan. repayments

Maximum Entropy, Parallel Computation and Lotteries

Logistic Regression, AdaBoost and Bregman Distances

OPTIMALLY EFFICIENT MULTI AUTHORITY SECRET BALLOT E-ELECTION SCHEME

Finance Practice Problems

THE PRINCIPLE OF THE ACTIVE JMC SCATTERER. Seppo Uosukainen

High-Performance Computing and Quantum Processing

Derivation of Annuity and Perpetuity Formulae. A. Present Value of an Annuity (Deferred Payment or Ordinary Annuity)

Project Request & Project Plan

Negotiation Programs

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Asymptotic Growth of Functions

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

5 Boolean Decision Trees (February 11)

Strategic Remanufacturing Decision in a Supply Chain with an External Local Remanufacturer

Chapter 3 Savings, Present Value and Ricardian Equivalence

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

CS103X: Discrete Structures Homework 4 Solutions

Asian Development Bank Institute. ADBI Working Paper Series

Paper SD-07. Key words: upper tolerance limit, macros, order statistics, sample size, confidence, coverage, binomial

Estimating Surface Normals in Noisy Point Cloud Data

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

Quality Provision in Two-Sided Markets: the Case of. Managed Care

Development of Customer Value Model for Healthcare Services

Cooley-Tukey. Tukey FFT Algorithms. FFT Algorithms. Cooley

CLOSE RANGE PHOTOGRAMMETRY WITH CCD CAMERAS AND MATCHING METHODS - APPLIED TO THE FRACTURE SURFACE OF AN IRON BOLT

Notes on Power System Load Flow Analysis using an Excel Workbook

Research Report 2012/13 International Graduate School for Dynamics in Logistics

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Systems Design Project: Indoor Location of Wireless Devices

Comparing Availability of Various Rack Power Redundancy Configurations

Comparing Availability of Various Rack Power Redundancy Configurations

Software Engineering and Development

Continuous Compounding and Annualization

CONCEPT OF TIME AND VALUE OFMONEY. Simple and Compound interest

Cloud Service Reliability: Modeling and Analysis

Skills Needed for Success in Calculus 1

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems


Valuation of Floating Rate Bonds 1

Infinite Sequences and Series

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

Course Notes: Nonlinear Dynamics and Hodgkin-Huxley Equations

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

MULTIPLE SOLUTIONS OF THE PRESCRIBED MEAN CURVATURE EQUATION

UNIT CIRCLE TRIGONOMETRY

Incremental calculation of weighted mean and variance

Modeling and Verifying a Price Model for Congestion Control in Computer Networks Using PROMELA/SPIN

Streamline Compositional Simulation of Gas Injections Dacun Li, University of Texas of the Permian Basin

THE ABRACADABRA PROBLEM

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Things to Remember. r Complete all of the sections on the Retirement Benefit Options form that apply to your request.

The transport performance evaluation system building of logistics enterprises

U.C. Berkeley CS270: Algorithms Lecture 9 Professor Vazirani and Professor Rao Last revised. Lecture 9

HEALTHCARE INTEGRATION BASED ON CLOUD COMPUTING

Semipartial (Part) and Partial Correlation

Clustering Process to Solve Euclidean TSP

4a 4ab b (count number of places from first non-zero digit to

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Chapter 19: Electric Charges, Forces, and Fields ( ) ( 6 )( 6

Instituto Superior Técnico Av. Rovisco Pais, Lisboa virginia.infante@ist.utl.pt

Vector Calculus: Are you ready? Vectors in 2D and 3D Space: Review

Network Theorems - J. R. Lucas. Z(jω) = jω L

Time Value of Money: The case of Arithmetic and Geometric growth and their Applications

Ilona V. Tregub, ScD., Professor

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

Chapter 7 Methods of Finding Estimators

Modified Line Search Method for Global Optimization

Framework for Computation Offloading in Mobile Cloud Computing

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Soving Recurrence Relations

Effect of Contention Window on the Performance of IEEE WLANs

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

Financing Terms in the EOQ Model

I. Chi-squared Distributions

Properties of MLE: consistency, asymptotic normality. Fisher information.

How to create a default user profile in Windows 7

Distributed Computing and Big Data: Hadoop and MapReduce

Department of Computer Science, University of Otago

An Efficient Group Key Agreement Protocol for Ad hoc Networks

Research on Risk Assessment of the Transformer Based on Life Cycle Cost

Transcription:

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 43 Efficiet Algoithms fo All-to-All Commuicatios i Multipot Message-Passig Systems Jehoshua Buc, Seio Membe, IEEE, Chig-Tie Ho, Membe, IEEE, Shlomo Kipis, Membe, IEEE, Eli Upfal, Seio Membe, IEEE, a Deic Weathesby Abstact We peset efficiet algoithms fo two all-to-all commuicatio opeatios i message-passig systems: iex (o all-toall pesoalize commuicatio) a cocateatio (o all-to-all boacast). We assume a moel of a fully coecte messagepassig system, i which the pefomace of ay poit-to-poit commuicatio is iepeet of the see-eceive pai. We also assume that each pocesso has pots, though which it ca se a eceive messages i evey commuicatio ou. The complexity measues we use ae iepeet of the paticula system topology a ae base o the commuicatio stat-up time, a o the commuicatio bawith. I the iex opeatio amog pocessos, iitially, each pocesso has blocs of ata, a the goal is to exchage the i th bloc of pocesso j with the j th bloc of pocesso i. We peset a class of iex algoithms that is esige fo all values of a that featues a tae-off betwee the commuicatio stat-up time a the ata tasfe time. This class of algoithms iclues two special cases: a algoithm that is optimal with espect to the measue of the stat-up time, a a algoithm that is optimal with espect to the measue of the ata tasfe time. We also peset expeimetal esults featuig the pefomace tueability of ou iex algoithms o the IBM SP- paallel system. I the cocateatio opeatio, amog pocessos, iitially, each pocesso has oe bloc of ata, a the goal is to cocateate the blocs of ata fom the pocessos, a to mae the cocateatio esult ow to all the pocessos. We peset a cocateatio algoithm that is optimal, fo most values of, i the umbe of commuicatio ous a i the amout of ata tasfee. Iex Tems All-to-all boacast, all-to-all pesoalize commuicatio, complete exchage, cocateatio opeatio, istibutememoy system, iex opeatio, message-passig system, multiscatte/gathe, paallel system. INTRODUCTION C ollective commuicatio opeatios [] ae commuicatio opeatios that geeally ivolve moe tha two pocessos, as oppose to the poit-to-poit commuicatio betwee two pocessos. Examples of collective commuicatio opeatios iclue: (oe-to-all) boacast, scatte, gathe, iex (all-to-all pesoalize commuicatio), a cocateatio (all-to-all boacast). See [3], [6] fo a suvey of collective commuicatio algoithms o vaious etwos with vaious commuicatio moels. The ee fo collective commuicatio aises fequetly i paallel computatio. Collective commuicatio opeatios simplify the pogammig of applicatios fo paallel computes, facilitate the implemetatio of efficiet commuicatio schemes o vaious machies, pomote the potability of J. Buc is with the Califoia Istitute of Techology, Mail Coe 36-93, Pasaea, CA 95. E-mail: buc@paaise.caltech.eu. C.-T. Ho a E. Upfal ae with IBM Almae Reseach Cete, 65 Hay R., Sa Jose, CA 95. E-mail: {ho, upfal}@almae.ibm.com. S. Kipis is with News Datacom Reseach Lt., 4 Wegewoo St., Haifa 34635, Isael. E-mail: sipis@c.co.il. D. Weathesby is with the Depatmet of Compute Sciece a Egieeig, Uivesity of Washigto, Seattle, WA 9895. E-mail: eic@cs.washigto.eu. Mauscipt eceive 6 Ap. 994; evise 7 Ap. 997. Fo ifomatio o obtaiig epits of this aticle, please se e-mail to: tps@compute.og, a efeece IEEECS Log Numbe 8. applicatios acoss iffeet achitectues, a eflect coceptual goupig of pocesses. I paticula, collective commuicatio is use extesively i may scietific applicatios fo which the iteleavig of stages of local computatio with stages of global commuicatio is possible (see []). This pape stuies the esig of all-to-all commuicatio algoithms, amely, collective opeatios i which evey pocesso both ses ata to a eceives ata fom evey othe pocesso. I paticula, we focus o two wiely use opeatios: iex (o all-to-all pesoalize commuicatio) a cocateatio (o all-to-all boacast). The algoithms escibe hee ae icopoate ito the Collective Commuicatio Libay (CCL) [], which was esige a evelope fo the ew IBM lie of scalable paallel computes. The fist compute i this lie, the IBM 976 Scalable POWERpaallel System (SP), was aouce i Febuay 994.. Defiitios a Applicatios INDEX: The system cosists of pocessos p, p, º, p -. Iitially, each pocesso p i has blocs of ata B[i, ], B[i, ], º, B[i, - ], whee evey bloc B[i, j] is of size b. The goal is to exchage bloc B[i, j] (the jth ata bloc of pocesso p i ) with bloc B[j, i] (the ith ata bloc of pocesso p j ), fo all i, j -. The fial 45-99/97/$. 997 IEEE

44 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 esult is that each pocesso p i, fo i -, hols blocs B[, i], B[, i], º, B[ -, i]. CONCATENATION: The system cosists of pocessos p, p, º, p -. Iitially, each pocesso p i has a bloc of ata B[i] of size b. The goal is to mae the cocateatio of the ata blocs, amely, B[] B[] B[ - ], ow to all the pocessos. Both the iex a cocateatio opeatios ae use extesively i istibute-memoy paallel computes a ae iclue i the Message-Passig Iteface (MPI) staa poposal [4]. (The iex opeatio is efee to as MPI_Alltoall i MPI, while the cocateatio is efee to as MPI_Allgathe i MPI.) Fo example, the iex opeatio ca be use fo computig the taspose of a matix, whe the matix is patitioe ito blocs of ows (o colums) with iffeet blocs esiig o iffeet pocessos. Thus, the iex opeatio ca be use to suppot the emappig of aays i HPF compiles, such as emappig the ata layout of a two-imesioal aay fom (bloc, *) to (cyclic, *), o fom (bloc, *) to (*, bloc). The iex opeatio is also use i FFT algoithms [], i Asce a Desce algoithms [6], i the Alteatig Diectio Implicit (ADI) metho [], a i the solutio of Poisso s poblem by the Fouie Aalysis Cyclic Reuctio (FACR) metho [8], [3], o the two-imesioal FFT metho [8]. The cocateatio opeatio ca be use i matix multiplicatio [9] a i basic liea algeba opeatios [].. Commuicatio Moel We assume a moel of a multipot fully coecte message-passig system. The assumptio of full coectivity meas that each pocesso ca commuicate iectly with ay othe pocesso a that evey pai of pocessos ae equally istat. The assumptio of multiple pots meas that, i evey commuicatio step (o ou), each pocesso ca se istict messages to pocessos a simultaeously eceive messages fom othe pocessos, fo some. Thoughout the pape, we assume -, whee is the umbe of pocessos i the system. The multipot moel geealizes the oe-pot moel that has bee wiely ivestigate. Thee ae examples of paallel systems with -pot capabilities fo >, such as the CUBE/, the CM- (whee is the imesio of the hypecube i both machies), a taspute-base machies. Such a fully coecte moel aesses emegig tes i may moe istibute-memoy paallel computes a message-passig commuicatio eviomets. These tes ae eviet i systems such as IBM s Vulca [6], MIT s J-Machie [], NCUBE s CUBE/ [5], Thiig Machies CM-5 [9], a IBM s 976 Scalable POWERpaallel System, a i eviomets such as IBM EUI [], PICL [4], PARMACS [7], Zipcoe [7], a Expess [3]. These systems a eviomets geeally igoe the specific stuctue a topology of the commuicatio etwo a assume a fully coecte collectio of pocessos, i which each pocesso ca commuicate iectly with ay othe pocesso by seig a eceivig messages. The fact that this moel oes ot assume ay sigle topology maes it geeal a flexible. Fo istace, this moel allows the evelopmet of algoithms that ae potable betwee iffeet machies, that ca opeate withi abitay a yamic subsets of pocessos, a that ca opeate i the pesece of faults (assumig coectivity is maitaie). I aitio, algoithms evelope fo this moel ca also be helpful i esigig algoithms fo specific topologies. We use the liea moel [3] to estimate the commuicatio complexity of ou algoithms. I the liea moel, the time to se a m-byte message fom oe pocesso to aothe, without cogestio, ca be moele as T = b + mt, whee b is the ovehea (stat-up time) associate with each se o eceive opeatio, a t is the commuicatio time fo seig each aitioal byte (o ay appopiate ata uit). Fo coveiece, we efie the followig two tems i oe to estimate the time complexities of ou commuicatio algoithms i the liea moel: C : the umbe of commuicatio steps (o ous) equie by a algoithm. C is a impotat measue whe the commuicatio stat-up time is high, elative to the tasfe time, of oe uit of ata, a the message size pe se/eceive opeatio is elatively small. C : the amout of ata (i the appopiate uit of commuicatio: bytes, flits, o pacets) tasfee i a sequece. Specifically, let m i be the lagest size of a message (ove all pots of all pocessos) set i ou i. The, C is the sum of all the m i s ove all ous i. C is a impotat measue whe the statup time is small compae to the message size. Thus, i ou fully coecte, liea moel, a algoithm has a estimate commuicatio time complexity of T = C b + C t. It shoul be ote that thee ae moe etaile commuicatio moels, such as the BSP moel [3], the Postal moel [3], a the LogP moel [9], which futhe tae ito accout that a eceivig pocesso geeally completes its eceive opeatio late tha the coespoig seig pocesso fiishes its se opeatio. Howeve, esigig pactical a efficiet algoithms i these moels is substatially moe complicate. Aothe impotat issue is the uifomity of the implemetatio. Fo example, i the LogP moel, the esig of collective commuicatio algoithms is base o P, the umbe of pocessos. Optimal algoithms fo two istict values of P may be vey iffeet. This pesets a challege whe the goal is to suppot collective commuicatio algoithms fo pocesso goups with vaious sizes while usig oe collective commuicatio libay..3 Mai Cotibutios a Ogaizatio We stuy the complexity of the iex a cocateatio opeatios i the -pot fully coecte message-passig moel. We eive lowe bous a evelop algoithms fo these opeatios. The followig is a esciptio of ou mai esults: Lowe bous: Sectio povies lowe bous o the complexity measues C a C fo both the cocateatio a the iex opeatios.

BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 45 Fo the cocateatio opeatio, we show that ay algoithm equies C log commuicatio + b a -f ous a ses C uits of ata. Fo the iex opeatio, we show that ay algoithm equies C log commuicatio ous a ses C + b a -f uits of ata. We also show that, whe is a powe of +, ay iex algoithm that uses the miimal umbe of commuicatio ous (i.e., C = log + ) must tasfe b C + log + uits of ata. Fially, we show that, i the oe-pot moel, if the umbe of commuicatio ous C is O(log ), the C must be W(b log ). Iex algoithms: Sectio 3 escibes a class of efficiet algoithms fo the iex opeatio amog pocessos. This class of algoithms is esige fo abitay values of a featues a tae-off betwee the stat-up time (measue C ) a the ata tasfe time (measue C ). Usig a paamete, whee, the commuicatio complexity - measues of the algoithms ae C = log - a C b log. Note that, followig ou lowe bou esults, optimal C a C caot be obtaie simultaeously. To icease the pefomace of the iex opeatio, the paamete ca be caefully chose as a fuctio of the stat-up time b, the ata tasfe ate t, the message size b, the umbe of pocessos, a the umbe of pots. Two special cases of this class ae of paticula iteest: Oe case exhibits the miimal umbe of commuicatio ous (i.e., C is miimize to log + by choosig = + ), a aothe case featues the miimal amout of ata tasfee (i.e., C is miimize to b - by choosig = ). The oe-pot vesio of the iex algoithm was implemete o the IBM s SP- to cofim the existece of the tae-off betwee C a C. It shoul be ote that, whe is a powe of two, thee ae ow algoithms fo the iex opeatio which ae base o the stuctue of a hypecube (see [5], [], [8]). Howeve, oe of these algoithms ca be easily geealize to values of that ae ot powes of two without losig efficiecy. The iea of a tae-off betwee C a C is ot ew a has bee applie to hypecubes i [5], [8]. Cocateatio algoithms: Sectio 4 pesets algoithms fo the cocateatio opeatio i the -pot moel. These algoithms ae optimal fo ay values of, b, a, except fo the followig age: b 3, 3, a ( + ) - < < ( + ), fo some. (Thus, if b = o =, which coves most pactical cases, ou algoithm is optimal.) I this special age, we achieve eithe optimal C a suboptimal C (oe moe tha the lowe bou log + ), o optimal C a suboptimal b a -f C (at most b - moe tha the lowe bou ). Pseuocoe: Appeices A a B povie pseuocoe fo the iex a cocateatio algoithms, espectively, i the oe-pot moel. Both the iex a cocateatio opeatios wee iclue i the Collective Commuicatio Libay [] of the Exteal Use Iteface (EUI) [] fo the 976 Scalable POWERpaallel System (SP) by IBM. I aitio, these oe-pot vesios of the algoithms have bee implemete o vaious aitioal softwae platfoms icluig PVM [5], a Expess [3]. LOWER BOUNDS This sectio povies lowe bous o the complexity measues C a C fo algoithms that pefom the cocateatio a iex opeatios. Popositio. was show i [3]. We iclue it hee fo completeess.. Lowe Bous fo the Cocateatio Opeatio PROPOSITION.. I the -pot moel, fo, ay cocateatio algoithm equies C log + commuicatio ous. PROOF. Focus o oe paticula pocesso, say, pocesso p. The cocateatio opeatio equies, amog othe thigs, that the ata bloc B[] of pocesso p be boacast amog the pocessos. With commuicatio pots pe pocesso, ata bloc B[] ca each at most ( + ) pocessos i commuicatio ous. Fo ( + ) to be at least, we must have log+ commuicatio ous. PROPOSITION.. I the -pot moel, fo, ay cocateatio algoithm tasfes C uits of b a -f ata. PROOF. Each pocesso must eceive the - ata blocs of the othe - pocessos, the combie size of which is b( - ) uits of ata. Sice each pocesso ca use its iput pots simultaeously, the amout of ata tasfee though oe of the iput pots must be at least b a -f.. Lowe Bous fo the Iex Opeatio PROPOSITION.3. I the -pot moel, fo, ay iex algoithm equies C log commuicatio ous. + PROOF. Ay cocateatio opeatio o a aay B[i], i <, ca be euce to a iex opeatio o B[i, j], i, j <, by lettig B[i, j] = B[i] fo all i a j. Thus, the popositio follows fom Popositio.. PROPOSITION.4. I the -pot moel, fo, ay iex algoithm tasfes C uits of b a -f ata. PROOF. Simila to the poof of Popositio.3, the popositio follows fom Popositio..

46 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997.3 Compou Lowe Bous fo the Iex Opeatio Hee, we povie aitioal lowe bous fo the iex opeatio. These lowe bous chaacteize the measue C as a fuctio of C a vice vesa. Theoems.5 a.7 show that whe C is optimize fist, the lowe bou o C becomes a oe of O(log + ) highe tha the staaloe lowe bou give i Popositio.4. The, Theoem.6 shows that whe C is optimize fist, the lowe bou o C becomes ( - )/ as oppose to log +. Fially, Theoem.9 gives a moe geeal lowe bou fo the oe-pot case. THEOREM.5. If = ( + ), fo some itege, the ay iex algoithm that uses exactly C = log + commuicatio ous must tasfe at least C = b + log + uits of ata. PROOF. Let = ( + ). I oe to fiish the algoithm i exactly log + = ous, the umbe of pocessos havig eceive ata fom a give pocesso, say p i, must gow by a facto of + i evey ou. This efies a uique stuctue of the spaig tee T i, which is oote at p i, that is a geealize vesio of the biomial tee use to istibute the - ata blocs of pocesso p i amog the othe - pocessos. Deote by, j the umbe of pocessos at level j i tee T i oote at pocesso p i. Oe may use iuctio to show that l j j =ej j. Now, the total amout of ata D i that is ijecte ito the etwo ove the eges of the biomial tee T i oote at p i is give by  D b j b j i lj j = = F H I K =  j= j= j b, + whee the last equality step ca be eive by iffeetiatig both sies of j ÂF j H I = + K b g j= a the multiplyig both sies by b. Now, clealy, C - Di b b  = = log + + +. i= THEOREM.6. Ay algoithm fo the iex opeatio that tasfes exactly C = uits of ata fom each pocesso b a -f equies C - commuicatio ous. PROOF. I the iex opeatio, each pocesso has - ata blocs that it ees to se to the othe - pocessos. If each pocesso is allowe to tasfe at most b a -f uits of ata pe pot ove all ous, the it must be the case that the jth ata bloc of pocesso p i is set iectly fom pocesso p i to pocesso p j. (That is, each ata bloc is set exactly oce fom its souce to its estiatio, a o pocesso ca fowa ata blocs of othe pocessos.) I this case, each pocesso must se - istict messages to the othe - pocessos. Ay such algoithm must equie C - ous. THEOREM.7. Ay iex algoithm that uses C = Èlog + commuicatio ous must tasfe at least b C = W log + + i uits of ata. PROOF. It is sufficiet to pove the theoem fo b =. Cosie ay algoithm that fiishes the iex opeatio i = C (miimum) ous. We show that the algoithm execute a total of W( log + ) ata tasmissios (ove all oes), thus, thee is a pot that tasmitte W log + + i uits of ata. We fist cocetate o the ata istibutio fom a give souce oe v to all othe - oes. Ay such algoithm ca be chaacteize by a sequece of + sets, S, S,, S, whee S i is the set of oes that have eceive thei espective ata by the e of commuicatio ou i. Thus, S = {v}, S =, a S i cotais S i-, plus oes that eceive ata fom oes i S i- i the ith commuicatio ous. Let x i = S i. Clealy, x i x i+ ( + )x i, because each oe i S i ca se ata to at most othe oes ue the -pot moel. Next, we assig weights to the oes i the sets, S i s, whee the weight of a oe u i S i epesets the path legth (o the umbe of commuicatio ous icue) fom v to u i achievig the ata istibutio. The weights ca be assige base o the followig ule. If a oe u appeas fist i S i ue to a ata tasmissio fom oe w i S i-, the the weight of u is the weight of w plus oe. Note that, oce a oe is assige a weight, it hols the same weight i all subsequet sets. By Lemma.8, we ow that thee ae at most j f ej f oes of weight f i S j. Ou goal is to give a lowe bou fo the sum of the weights of the oes i S. Without loss of geeality, we ca assume that the sum of the weights is the miimum possible. f  ej b g Let X = f = +. By the choice of, f= X < b+ g. - Let Y f =. Sice, fo  f= ej f f -f,e j e fj, f - f - Y X - = + < + b g.

BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 47 Thus, the algoithm must use all the possible oes with weights less tha. To bou the sum of the weights,we ee a lowe bou o Fo f -, f - Z = Â ff Hf I K f. f= f ej is mooto i f. Thus, at least i. / of the oes have weight at least - That is, Z = W(). Summig ove all oigis, the total umbe of tasmissios is at least Z = W( ). Thus, at least oe pot has a sequece of C W W log + = F H G I = KJ F HG + ata tasmissios. j f LEMMA.8. Thee ae o moe tha ej f oes of weight f i S j (efie i the poof of Theoem.7). PROOF. We pove by iuctio o j. Thee is clealy o moe tha oe oe of weight zeo a oes of weight oe i S. Assume that the hypothesis hols fo j -. j- f Note that S j cotais up to e f j oes of weight f that appeae with the same weight i S j-, plus up to j- f- ef-j oes that eceive ata at commuicatio ou j fom oes with weight f - i S j-. The claim hols fo j sice F H I j - j + f K Hf -K F I I KJ j f f f- f = F H I K. THEOREM.9. Whe =, ay algoithm fo the iex opeatio that uses C = O(log ) commuicatio ous must tasfe C = W(b log ) uits of ata. PROOF. Assume that thee is a algoithm with C c log fo some costat c. Cosie the biomial istibutio e j j. Let h be the miimal,, such that clog l+ clog e j j. Oe ca show that ay algoithm Âj= that fiishes i c log ous must have the followig popety. Fo evey j such that j h, thee exist clog e j j messages fom each oe that tavel at least j hops i the etwo. Notice that, i this popety, each message ca oly be coute oce fo a give j. Theefoe, the aveage umbe of hops a message has to tavel fo each oe is h/, if h log, o log /, if h log. Sice h must be W(log ) fom Lemma C. i Appeix C, we have C = W(b log ). 3 INDEX ALGORITHMS This sectio pesets a class of efficiet algoithms fo the iex opeatio. Fist, we povie a oveview of the algoithms. The, we focus o the commuicatio phase of the algoithms fo the oe-pot moel. Next, we escibe two special cases of this class of algoithms. The, we geealize the algoithms to the -pot moel. A fially, we commet o the implemetatio a pefomace of this class of algoithms. 3. Oveview The class of algoithms fo the iex opeatio amog pocessos ca be epesete as a sequece of pocessomemoy cofiguatios. Each pocesso-memoy cofiguatio has colums of blocs each. Colums ae labele fom though - (fom left to ight i the figues) a blocs ae labele fom though - (fom top to bottom i the figues). Colum i epesets pocesso p i, a bloc j epesets the jth ata bloc i the memoy offset. The objective of the iex opeatio, the, is to taspose these colums of blocs. Fig. shows a example of the pocesso-memoy cofiguatios befoe a afte the iex opeatio fo = 5 pocessos. The otatio ij i each box epesets the jth ata bloc iitially allocate to pocesso p i. The label j is efee to as the bloc-i. All the algoithms i the class cosist of thee phases. Phases a 3 equie oly local ata eaagemet o each pocesso, while Phase ivolves itepocesso commuicatio. PHASE. Each pocesso p i iepeetly otates its ata blocs i steps upwas i a cyclical mae. PHASE. Each pocesso p i otates its jth ata bloc j steps to the ight i a cyclical mae. This otatio is im- Fig.. Memoy-pocesso cofiguatios befoe a afte a iex opeatio o five pocessos.

48 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 Fig.. A example of memoy-pocesso cofiguatios fo the thee phases of the iex opeatio o five pocessos. plemete by itepocesso commuicatio. PHASE 3. Each pocesso p i iepeetly otates its ata blocs i steps owwas i a cyclical mae. Fig. pesets a example of these thee phases of the algoithm fo pefomig a iex opeatio amog = 5 pocessos. The implemetatio of Phases a 3 o each pocesso ivolves oly local ata movemets a is staightfowa. I the sequel, we focus oly o the implemetatio of Phase. Diffeet algoithms ae eive epeig o how the commuicatio patte of Phase is ecompose ito a sequece of poit-to-poit commuicatio ous. 3. The Itepocesso Commuicatio Phase We peset the ecompositio of Phase ito a sequece of poit-to-poit commuicatio ous, assumig the oepot moel a usig a paamete (fo aix) i the age. Fo coveiece, we say that the bloc-i of the jth ata bloc i each pocesso afte Phase is j. Cosie the otatio equie i Phase. Each bloc with a bloc-i j i pocesso i ees to be otate to pocesso (i + j) mo. The bloc-i j, whee j -, ca be ecoe usig aix- epesetatio usig w = log igits. Fo coveiece, we efe to these w igits fom zeo though w - statig with the least sigificat igit. Ou algoithm fo Phase cosists of w subphases coespoig to the w igits. Each subphase cosists of at most - steps, coespoig to the (up to) - iffeet o-zeo values of a give igit. I subphase x, fo x w -, we iteate Step though Step -, as follows: Duig Step z of subphase x, whee z - a x w -, all ata blocs, fo which the xth igit of thei bloc-i is z, ae otate z x steps to the ight. This is accomplishe i a commuicatio ou by a iect poit-to-poit commuicatios betwee pocesso i a pocesso (i + z x ) mo, fo each i -. Fo example, whe is chose to be 3, the fifth bloc will be otate two steps to the ight uig Step of Subphase, a late otate agai thee steps to the ight uig Step of Subphase. This follows fom the fact that 5 is ecoe ito usig aix-3 epesetatio. Note that, afte w subphases, all ata blocs have bee otate to the coect estiatio pocesso as specifie by the pocesso i. Howeve, ata blocs ae ot ecessaily i thei coect memoy locatios. Phase 3 of the algoithm fixes this poblem. The followig poits ae mae egaig the pefomace of this algoithm. Each step ca be ealize by a sigle commuicatio ou by pacig all the outgoig blocs to the same estiatio ito a tempoay aay a seig them togethe i oe message. Hece, each subphase ca be ealize i at most - commuicatio ous. The size of each message ivolve i a commuicatio ou is at most b ata. Hece, the class of the iex algoithms has complexity measues C - a f log a a f log, C b - whee is chose i the age. 3.3 Two Special Cases The class of algoithms fo the iex opeatio i the oepot moel cotais two iteestig special cases: ) Whe =, the eive algoithm equies

BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 49 Fig. 3. A example of memoy-pocesso cofiguatios fo the iex algoithm o five pocessos, which has a optimal C measue. C = log commuicatio ous, which is optimal with espect to the measue C. Also, i this case, C b log, which is optimal (to withi a multiplicative facto) fo the case whe C = log. Fig. 3 shows such a example with = a = 5. The shae ata blocs ae the oes subject to otatio uig the ext subphase. ) Whe =, the eive algoithm tasfes C = b( - ) uits of ata fom each oe, which is optimal with espect to the measue C. The value of C i this case is C = -, which is optimal fo the case whe C = b( - ). Hece, = shoul be chose whe the stat-up time of the uelyig machie is elatively sigificat, a the pouct of the bloc size b a the pe-elemet tasfe time is elatively small. O the othe ha, = shoul be chose whe the stat-up time is egligible. I geeal, ca be fie-tue accoig to the paametes of the uelyig machies to balace betwee the stat-up time a the ata tasfe time. 3.4 Geealizatio to the -Pot Moel We ow peset a moificatio to the iex algoithm above fo the -pot moel. Phase a Phase 3 of the algoithm emai the same. I Phase, we still have w = log subphases as befoe, coespoig to the w igits i aix- epesetatio of ay bloc-i j, whee j -. I each subphase, thee ae, at most, - iepeet poit-to-poit commuicatio steps that ee to be pefome. Sice these poit-to-poit commuicatio steps ae iepeet, they ca be pefome i paallel, subject to the costait o the umbe of paallel iput/output pots. Thus, evey of these commuicatio steps ca be goupe togethe a pefome cocuetly. Theefoe, each subphase cosists of at most - commuicatio steps. The complexity measues fo the iex algo- ithm ue the -pot moel, theefoe, ae - - C log a C b log, whee ca be chose i the age. To miimize both C a C, oe clealy ees to choose, such that ( - ) mo =. 3.5 Implemetatio We have implemete the oe-pot vesio ( = ) of the iex algoithm o a IBM SP- paallel system. (The IBM SP- is close to the oe-pot moel i the omai of the multipot moel.) The implemetatio is oe o top of the poit-to-poit message-passig exteal use iteface (EUI), uig o the EUIH eviomet. At this level, the commuicatio stat-up, b, measues about 9 msec, a the

5 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 Fig. 6. The measue times of the iex algoithm as a fuctio of aix fo vaious message sizes o a 64 oe SP-. Fig. 4. The measue time of the iex algoithm as a fuctio of message sizes o a 64 oe SP-. sustaie poit-to-poit commuicatio bawith is about 8.5 Mbytes/sec, i.e., t <. msec/byte. Fig. 4 shows the measue times of the iex algoithm as a fuctio of message size with vaious powe-of-two aices o a 64 oe SP-. As ca be see, the smalle aix tes to pefom bette fo smalle message sizes, a vice vesa. Fig. 5 compaes the measue times of the iex algoithm with =, = = 64, a optimal amog all poweof-two aices, espectively, o a 64 oe SP-. The beaeve poit of the message size betwee the two special cases of the iex algoithms (i.e., = a = ) occus at about to bytes. The iex algoithm with optimal powe-of-two aix, as expecte, is the best oveall choice. Fig. 6 shows the measue times of the iex algoithm as a fuctio of aix fo thee iffeet message sizes: 3 bytes, 64 bytes, a 8 bytes. As the message size iceases, the miimal time of the cuve tes to occu at a highe aix. Whe compaig these measue times with ou peicte times base o the liea moel, we fi big iscepacies quatitatively, but elatively cosistet qualitatively. Note that we ae maily iteeste i the qualitatively behavio of the iex algoithm o a geeal message-passig system. We believe the quatitative iffeeces betwee the measue times a the peicte times ae ue to the followig factos: ) Thee ae vaious system outies uig i the bacgou that have a highe pioity tha the use pocesses. ) We o ot moel the copy time icue by the fuctio copy, pac, a upac (see the pseuocoe i Appeix A). 3) We o ot moel the cogestio behavio of the SP-. 4) Thee is a slowow facto, somewhee betwee oe a two, fom the liea moel to the se_a_eceive moel. If we moel the cogestio behavio as a fixe multiplicative facto of t c a assume the system outies have a fixe slowow facto of the oveall time, the the total time fo the iex opeatio ca be moele as T = g C t s + g C t c + g 3. Fig. 5. The measue times of the iex algoithm with =, = = 64, a optimal amog all powe-of-two aices, espectively, o a 64 oe SP-. 4 CONCATENATION ALGORITHMS Thee ae two ow algoithms fo the cocateatio opeatio i the oe-pot moel. The fist is a simple folloe algoithm which cosists of two phases. I the fist phase, the blocs of ata fom the pocessos ae accumulate to a esigate pocesso, say pocesso p. This ca be oe usig a biomial tee (o a subtee of it whe is ot a powe of two). I the seco phase, the cocateatio esult fom pocesso p is boacast to the pocessos usig the same biomial tee. This algoithm is ot optimal sice it cosists of C = log commuicatio ous a

BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 5 tasfes C = b( - ) uits of ata. The seco ow cocateatio algoithm is fo the case whe is a powe of two a = (see []). This algoithm is base o the stuctue of a biay hypecube a is optimal i both C a C. Fo a give, this algoithm ca be geealize to the case whee is a powe of + by usig the stuctue of a geealize hypecube [4]. Howeve, fo geeal values of, we o ot ow of ay existig cocateatio algoithm that is optimal i both C a i C, eve whe b = =. I this sectio, we peset efficiet cocateatio algoithms fo the -pot commuicatio moel that, i most cases of a, ae optimal i both C a C. Thoughout this sectio, we assume that is i the age -. Notice that, fo -, the tivial algoithm that taes a sigle ou is optimal. The mai stuctue that we use fo eivig the algoithms is that of ciculat gaphs. We ote hee that ciculat gaphs ae also useful i costuctig fault-toleat etwos [7]. DEFINITION. A ciculat gaph G(, S) is chaacteize by two paametes: the umbe of oes, a a set of offsets S. I G(, S), the oes ae labele fom though -, a each oe i is coecte to oe ((i - s) mo ) a to oe ((i+s) mo ) fo all s Œ S (see []). The cocateatio algoithm cosists of ous. Let eges {(, ), (, ),, (, )}.) I geeal, i ou, whee -, we a eges with offsets i S to the cuet patial spaig tee to fom a ew lage patial spaig tee. It is easy to veify that, afte - ous, the esultig tee spas the fist oes statig fom oe, amely, oes though -. Fig. 7 illustates the pocess of costuctig T fo the case of = a = 9. Next, we use tee T to costuct the spaig tees T i, fo i -. We o this by taslatig each oe j i T to oe (j + i) mo i T i. Also, the ou i associate with each tee ege i T i (which epesets the ou uig which the coespoig commuicatio is pefome) is the same as that of the coespoig tee ege i T. Fig. 8 illustates tee T fo the case of = a = 9. It is easy to see that T was obtaie fom T by aig oe (moulo ie) to the labels of the oes i T. log, that is, ( + ) - < ( + ). Also let = = + +, whee = ( + ) - a. The ous of the algoithm ca be ivie ito two phases. The fist phase cosists of - ous, at the e of which evey oe has the cocateatio esult of the - oes that pecee it i the umbeig (i a ciculat sese). The seco phase cosists of a sigle ou a completes the cocateatio opeatio amog the oes. Fig. 7. The two ous i costuctig the spaig tee oote at oe fo = 9 a =. 4. The Fist - Rous Fo the fist - ous, we use a ciculat gaph G(, S), whee S = S < S < < S -, S i = {( + ) i, ( + ) i, º, ( + ) i }. We ietify the pocessos with the oes of G(, S), which ae labele fom though -. The commuicatio patte elate to boacastig the ata item of each oe ca be escibe by a spaig tee. Let T i eote the spaig tee associate with the ata item B[i] of oe i (amely, T i is oote at oe i). We escibe the spaig tee associate with each oe by specifyig the eges that ae use i evey commuicatio ou. The eges associate with ou i ae calle ou-i-eges. Fist, we escibe the tee T, a the we show how tee T i, fo i -, ca be eive fom tee T. We stat with a iitial tee T which cosists oly of oe. I ou, we a eges with offsets i S to T to fom a patial spaig tee; the ae eges ae the ou--eges. (That is, i ou, we a the set of Fig. 8. The two ous i costuctig the spaig tee oote at oe fo = 9 a =. They ca be eive by taslatig oe aesses of the spaig tee oote at oe i Fig. 7. The cocateatio algoithm i each oe is specifie by the tees T i, fo i -, as follows: I ou i, fo i -, o: Fo all j -, if ata item B[j] is peset at the oe, the se it o all ou-i-eges of tee T j. Receive the coespoig ata items o the ou-ieges of all the tees. THEOREM 4.. Afte - ous of the above algoithm, evey oe i, fo i -, has the ata items B[j], whee i j i - + j + (mo ). Also, uig these - ous, the measue C is optimal:

5 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 b C = c -h. PROOF. The spaig tees T i, fo T i -, ae eive fom T by shiftig the iices i a cyclic mae. Hece, it suffices to focus o the spaig tee T. Notice that the algoithm ca be implemete i a -pot moel, sice, i evey ou i, we use oly the set of offsets S i, which cosists of offsets. Also, the tee T is a spaig tee fo the oes p i, whee i -, because evey i i this age ca be epesete usig a set of istict offsets fom S. Hece, afte - ous of the algoithm, the ata items ae istibute accoig to the claim of the theoem. Next, we ee to pove that C associate with the - ous is as claime. By iuctio o i, it follows that, befoe ou i, ay oe has at most ( + ) i istict ata items. Hece, i ou i, ay oe ses at most ( + ) i ata items o ay give ege. Thus, b g b g b g c h - b C b + + + + + L + + = -. Howeve, by the lowe bou agumet, we have b C c -h, a the claim follows. 4. The Last Rou Befoe ou -, the last ou of the algoithm, we have the followig situatio: Evey oe i ha boacast its message to the - oes succeeig it i the cicula gaph a ha eceive the boacast message fom the - oes peceig it i the cicula gaph. Cosie tee T just befoe the last ou. The fist oes (oes though - ) ae iclue i the cuet tee, a the emaiig oes still ee to be spae. We big the followig popositio. PROPOSITION 4.. The last ou ca be pefome with C b =, fo ay combiatio of, b, a, except fo the followig age: b 3, 3, a ( + ) - < < ( + ), fo some. The poof of this popositio is somewhat complicate, a we oly give the mai ieas hee. The basic iea is to tasfom the scheulig poblem fo the last ou of the algoithm ito a table patitioig poblem. (I the sese that, if the table patitioig poblem ca be solve, the we have a optimal algoithm by eivig a optimal scheule fo the last ou.) The table patitioig poblem is efie as follows. Let a = b. Give a table of b ows a colums, we woul lie to patitio the table ito isjoit aeas, eote by A, A,, A, such that the colum-spa of A i, fo all i, is at most, whee the colum-spa of A i is efie as R i - L i + if R i a L i ae the ightmost a leftmost colums, espectively, touche by A i ; a the umbe of table eties i A i, fo all i, is at most a. If a solutio ca be fou to the table-patitioig poblem, the a scheule fo the last ou ca be eive as follows. Each of the table colums coespos to oe of the oes yet to be spae, a each of the b table ows epesets oe byte. Table elemets i the same aea, say A i, will use the same offset, which is etemie by the iex of the leftmost colum touche by A i. It ca be show that a staightfowa algoithm fo patitioig the table satisfies the above two coitios fo ay combiatio of, b, a, except fo the followig age: b 3, 3 a ( + ) - < < ( + ), fo some. Fo istace, Table pesets a patitioig example fo = 3, = 7, b = 3, a = 3, which fall i the optimal age of. The aea covee by A i is mae by the umbe i. Fom this table, oe ca eive the followig scheulig fo the last ou: The sum of the weighte eges with offset 3 (i aea A ) is 7. Thus, oe p 3 eceives thee bytes fom p, oe p 4 eceives thee bytes fom p, a oe p 5 eceives oe byte fom p. The sum of the weighte eges with offset 5 (i aea A ) is 7. Thus, oe p 5 eceives two bytes fom p, oe p 6 eceives thee bytes fom p, a oe p 7 eceives two bytes fom p 3. The sum of the weighte eges with offset 7 (i aea A 3 ) is 7. Thus, oe p 7 eceives oe byte fom p, oe p 8 eceives thee bytes fom p, a oe p 9 eceives thee bytes fom p. Afte otatio, to geeate spaig tees, each of which is oote at a iffeet oe, each oe i ees to se seve bytes to oes (i + 3) mo, (i + 5) mo, a (i + 7) mo, a eceive seve bytes fom oes (i - 3) mo, (i - 5) mo, a (i - 7) mo. THEOREM 4.3. The above cocateatio algoithm attais optimal C log a C = + = a f b- fo ay combiatio of, b, a, except fo the followig age: b 3, 3, a ( + ) - < < ( + ), fo some itege. PROOF. By combiig Theoem 4. a Popositio 4., we b b b- have C = - + =, which matches c h a f the lowe bou of C i Popositio.. Fig. 9 pesets a example of the cocateatio algoithm fo = a = 5. Note that, to simplify the pseuocoe iclue i Appeix A, we actually gow the spaig tee T i usig egative offsets. That is, i both the figue TABLE AN EXAMPLE OF THE TRANSFORMED PROBLEM FOR = 3 (p THROUGH p ), = 7 (p 3 THROUGH p 9 ), b = 3 (BYTES), AND = 3 (PORTS)

BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 53 Fig. 9. A example of the oe-pot cocateatio algoithm with five pocessos. a i the pseuocoe, left-otatios ae pefome istea of ight-otatios. REMARK. Fo the ooptimal age of, it is easy to achieve optimal C at the expese of iceasig C by oe ou ove the lowe bou. It is also easy to achieve optimal C a suboptimal C, whee C is at most b - moe tha the lowe bou. APPENDIX A PSEUDOCODE FOR THE INDEX ALGORITHM This appeix pesets pseuocoe fo the iex algoithm of Sectio 3 whe =. This pseuocoe setches the implemetatio of the iex opeatio i the Collective Commuicatio Libay of the EUI [] by IBM. I the pseuocoe, the fuctio iex taes six agumets: outmsg is a aay fo the outgoig message; blle is the legth i bytes of each ata bloc; imsg is a aay fo the icomig message; is the umbe of pocessos ivolve; A is the aay of the iffeet pocesso is, such that, A[i] = p i ; a is the aix use to tue the algoithm. Aays outmsg a imsg ae each of legth blle * bytes. Othe outies that appea i the coe ae as follows: Routie copy(a, B, le) copies aay A of size le bytes ito aay B. Routie geta(i,, A) etus the iex i that satisfies A[i] = i. The outie mo(x, y) etus the value x mo y i the age of though y -, eve fo egative x. The fuctio se_a_ecv taes six agumets: the outgoig message; the size of the outgoig message; the estiatio of the outgoig message; the icomig message; the size of the icomig message; a the souce of the icomig message. The fuctio se_a_ecv is suppote by IBM s Message Passig Libay (MPL) [] o SP- a SP-, a the ecet MPI staa [4]. It ca also be implemete as a combiatio of blocig se a oblocig eceive. I the followig pseuocoe, lies 3 a 4 coespo to Phase, lies 5 though coespo to Phase, a lies though 3 coespo to Phase 3. I Phase, thee ae w subphases, which ae iexe by i. Duig each subphase, each pocesso ees to pefom the se_a_ecv opeatio - times, except fo the last subphase, whee each pocesso pefoms the se_a_ecv opeatio oly w- - times. Lies 7 though tae ito accout the special case fo the last subphase. The outie pac is use to pac those blocs that ee to be otate to the same itemeiate estiatio ito a cosecutive aay. Specifically, pac(a, B, blle,,, i, j, blocs) pacs some selecte blocs of aay A ito aay B; each bloc is of size blle i bytes; those blocs, fo which the ith igit of the aix- epesetatio of thei bloc is ae equal to j, ae selecte fo pacig; a value of the umbe of selecte blocs is witte to the agumet blocs. The outie upac(a, B, blle,,, i, j, blocs) is efie as the ivese fuctio of

54 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 pac whee B becomes the iput aay to be upace a A becomes the output aay. Fuctio iex (outmsg, blle, imsg,, A, ) () w = Èlog () my_a = geta (my_pi,, A) (3) copy (outmsg, tmp[( - my_a) * blle], my_a * blle) (4) copy (outmsg [my_a * blle], tmp, ( - my_a) * blle) (5) ist = (6) fo i = to w - o (7) if (i == w - ) the (8) h = ist (9) else () h = () eif () fo j = to h - o (3) est_a = mo (my_a + j * ist, ) (4) sc_a = mo (my_a - j * ist, ) (5) pac (tmp, pace_msg, blle,,, i, j, blocs) (6) se_a_ecv (pace_msg, blle * blocs, A [est_a], pace_msg, blle * blocs, A [sc_a]) (7) upac (tmp, pace_msg, blle,,, i, j, blocs) (8) efo (9) ist = ist * () efo () fo i = to - o () copy (tmp [mo (my_a - i, ) * blle], imsg [i * blle], blle) (3) efo (4) etu APPENDIX B PSEUDOCODE FOR THE CONCATENATION ALGORITHM This appeix pesets pseuocoe fo the cocateatio algoithm of Sectio 4 whe =. This pseuocoe setches the implemetatio of the cocateatio opeatio i the Collective Commuicatio Libay of the EUI [] by IBM. I this pseuocoe, the fuctio cocat taes five agumets: outmsg is a aay fo the outgoig message; le is the legth i bytes of aay outmsg; imsg is a aay fo the icomig message; is the umbe of pocessos ivolve; a A is the aay of the iffeet pocesso is, such that, A[i] = p i. Aay imsg is of legth le * bytes. The fuctio cocat ses a eceives messages usig the se_a_ecv outie. The outies copy, geta, se_a_ecv, a mo wee efie i Appeix A. I the followig pseuocoe, each pocesso fist iitializes some vaiables a copies its outmsg aay ito a tempoay aay temp (lies though 5). The, each pocesso pefoms the fist - ous of the algoithm (lies 6 though ). The, each pocesso pefoms the last ou of the algoithm (lies 3 a 6). Fially, each pocesso pefoms a local cicula shift of the ata such that all ata blocs i its imsg aay begi with the bloc B[] (lies 7 a 8). Fuctio cocat (outmsg, le, imsg,, A) () = Èlog () my_a = geta (my_pi,, A) (3) copy (outmsg, temp, le) (4) bl = (5) cuet_le = le (6) fo = to - o (7) est_a = mo (my_a - bl, ) (8) sc_a = mo (my_a + bl, ) (9) se_a_ecv (temp, cuet_le, A [est_a],temp [cuet_le], cuet_le, A [sc_a]) () bl = bl * () cuet_le = cuet_le * () efo (3) cuet_le = le * ( - bl) (4) est_a = mo (my_a - bl, ) (5) sc_a = mo (my_a + bl, ) (6) se_a_ecv (temp, cuet_le, A [est_a], temp [cuet_le], cuet_le, A [sc_a]) (7) copy (temp, imsg [le * my_a], le * ( - my_a)) (8) copy (temp [le * ( - my_a)], imsg, le * my_a) (9) etu APPENDIX C PROOF OF A LEMMA LEMMA C.. Let c a m be iteges such that c m. The, if h cm m  e j j j, the h mi(m/64, m/8 log c). = PROOF. Assume, fo the sae of cotaictio, that the lemma oes ot hol. Fist, ote that the lemma hols if h m/64, so it must be the case that h < m/64. Also, cm m ote that e j = <, so h a m > 64. Theefoe, h + m cm. Because h < m/64 cm/8, the tems h i the summatio cm m  e j j j ae mootoically = iceasig, so h  b g b ga f! m cm cm h F I H j K h + F I H j K h + cm h j= Note that h! h h+/ /e h, so m log (h + ) + h log (cm) + h log e - (h + /) log h (h + ) log (cm) + h log e - h log h. Because h < m/64 m/( log e), m/ h (log (cm) - log h) + log (cm). Because log (cm) log m m/4, it follows that m/4 h (log (cm) - log h). Let h = m/x a ote that x > 64, so

BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 55 m/4 (m/x)(log c + log x), which implies that x 4 log c + 4 log x a x - 4 log x 4 log c. Note that x 8log x, so x/ x - 4 log x 4 log c. Theefoe, x 8log c a h = m/x m/8 log c, which is a cotaictio. ACKNOWLEDGMENTS We tha Robet Cyphe fo his help i eivig Lemma C.. Jehoshua Buc was suppote i pat by U.S. Natioal Sciece Fouatio Youg Ivestigato Awa CCR- 94578, by the Sloa Reseach Fellowship, a by DARPA a BMDO though a ageemet with NASA/OSAT. REFERENCES [] V. Bala, J. Buc, R. Byat, R. Cyphe, P. ejog, P. Elustoo, D. Fye, A. Ho, C.-T. Ho, G. Iwi, S. Kipis, R. Lawece, a M. Si, The IBM Exteal Use Iteface fo Scalable Paallel Systems, Paallel Computig, vol., o. 4, pp. 445 46, Ap. 994. [] V. Bala, J. Buc, R. Cyphe, P. Elustoo, A. Ho, C.-T. Ho, S. Kipis, a M. Si, CCL: A Potable a Tuable Collective Commuicatio Libay fo Scalable Paallel Computes, IEEE Tas. Paallel a Distibute Systems, vol. 6, o., pp. 54 64, Feb. 995. [3] A. Ba-Noy a S. Kipis, Desigig Boacastig Algoithms i the Postal Moel fo Message-Passig Systems, Mathematical Systems Theoy, vol. 7, o. 5, pp. 43-45, Sept./Oct. 994. [4] L. Bhuya a D. Agawal, Geealize Hypecube a Hypebus Stuctues fo a Compute Netwo, IEEE Tas. Computes, vol. 33, o. 4, pp. 33 333, Ap. 984. [5] S. Bohai, Multiphase Complete Exchage o a Cicuit- Switche Hypecube, Poc. 99 It l Cof. Paallel Pocessig, vol. I, pp. 55 58, Aug. 99. [6] J. Buc, R. Cyphe, L. Gavao, A. Ho, C.-T. Ho, S. Kipis, S. Kostatiiou, M. Si, a E. Upfal, Suvey of Routig Issues fo the Vulca Paallel Compute, IBM Reseach Repot, RJ-8839, Jue 99. [7] J. Buc, R. Cyphe, a C.-T. Ho, Fault-Toleat Meshes a Hypecubes with Miimal Numbes of Spaes, IEEE Tas. Computes, vol. 4, o. 9, pp.,89,4, Sept. 993. [8] C.Y. Chu, Compaiso of Two-imesioal FFT Methos o the Hypecubes, Poc. Thi Cof. Hypecube Cocuet Computes a Applicatios, pp.,43,437, 988. [9] D. Culle, R. Kap, D. Patteso, A. Sahay, K.E. Schause, E. Satos, R. Subamoia, a T. vo Eice, LogP: Towas a Realistic Moel of Paallel Computatio, Poc. Fouth SIGPLAN Symp. Piciples a Pactices Paallel Pogammig, ACM, May 993. [] W.J. Dally, A. Chie, S. Fise, W. Howat, J. Kee, M. Laivee, R. Lethi, P. Nuth, S. Wills, P. Caic, a G. Fyle, The J-Machie: a Fie-Gai Cocuet Compute, Poc. Ifomatio Pocessig 89, pp.,47,53, 989. [] B. Elspas a J. Tue, Gaphs with Ciculat Ajacecy Matices, J. Combiatoial Theoy, o. 9, pp. 97 37, 97. [] G. Fox, M. Johsso, G. Lyzega, S. Otto, J. Salmo, a D. Wale, Solvig Poblems o Cocuet Pocessos, Vol. I. Petice Hall, 988. [3] P. Faigiau a E. Laza, Methos a Poblems of Commuicatio i Usual Netwos, Discete Applie Math., vol. 53, pp. 79 33, 994. [4] G.A. Geist, M.T. Heath, B.W. Peyto, a P.H. Woley, A Use s Guie to PICL: A Potable Istumete Commuicatio Libay, ORNL Techical Repot o. ORNL/TM-66, Oct. 99. [5] G.A. Geist a V.S. Sueam, Netwo Base Cocuet Computig o the PVM System, ORNL Techical Repot o. ORNL/TM-76, Jue 99. [6] S.M. Heetiemi, S.T. Heetiemi, a A.L. Liestma, A Suvey of Gossipig a Boacastig i Commuicatio Netwos, Netwos, vol. 8, pp. 39-349, 988. [7] R. Hempel, The ANL/GMD Macos (PARMACS) i FORTRAN fo Potable Paallel Pogammig Usig the Message Passig Pogammig Moel, Use s Guie a Refeece Maual, techical memoaum, Gesellschaft fü Mathemati u Dateveabeitug mbh, West Gemay. [8] C.-T. Ho a M.T. Raghuath, Efficiet Commuicatio Pimitives o Hypecubes, Cocuecy: Pactice a Expeiece, vol. 4, o. 6, pp. 47 458, Sept. 99. [9] S.L Johsso a C.-T. Ho, Matix Multiplicatio o Boolea Cubes Usig Geeic Commuicatio Pimitives, Paallel Pocessig a Meium-Scale Multipocessos, A. Wou, e., pp. 8 56. SIAM, 989. [] S.L. Johsso a C.-T. Ho, Spaig Gaphs fo Optimum Boacastig a Pesoalize Commuicatio i Hypecubes, IEEE Tas. Computes, vol. 38, o. 9, pp.,49,68, Sept. 989. [] S.L. Johsso a C.-T. Ho, Optimizig Tiiagoal Solves fo Alteatig Diectio Methos o Boolea Cube Multipocessos, SIAM J. Scietific a Statistical Computig, vol., o. 3, pp. 563 59, 99. [] S.L. Johsso, C.-T. Ho, M. Jacquemi, a A. Ruttebeg, Computig Fast Fouie Tasfoms o Boolea Cubes a Relate Netwos, Avace Algoithms a Achitectues fo Sigal Pocessig II, vol. 86, pp. 3 3. Soc. Photo-Optical Istumetatio Egiees, 987. [3] O.A. McBya a E.F. Va e Vele, Hypecube Algoithms a Implemetatios, SIAM J. Scietific a Statistical Computig, vol. 8, o., pp. 7 87, Ma. 987. [4] Message Passig Iteface Foum, MPI: A Message-Passig Iteface Staa, May 994. [5] J.F. Palme The NCUBE Family of Paallel Supecomputes, Poc. It l Cof. Compute Desig, 986. [6] F.P. Pepaata a J.E. Vuillemi, The Cube Coecte Cycles: A Vesatile Netwo fo Paallel Computatio, Comm. ACM, vol. 4, o. 5, pp. 3 39, May 98. [7] A. Sjellum a A.P. Leug, Zipcoe: A Potable Multicompute Commuicatio Libay Atop the Reactive Keel, Poc. Fifth Distibute Memoy Computig Cof., pp. 38 337, Ap. 99. [8] P.N. Swaztaube, The Methos of Cyclic Reuctio, Fouie Aalysis, a the FACR Algoithm fo the Discete Solutio of Poisso s Equatio o a Rectagle, SIAM Rev., vol. 9, pp. 49 5, 977. [9] Coectio Machie CM-5 Techical Summay. Thiig Machies Copoatio, 99. [3] L.G. Valiat, A Bigig Moel fo Paallel Computatio, Comm. ACM, vol. 33, o. 8, pp. 3, Aug. 99. [3] Expess 3. Itouctoy Guie. Paasoft Copoatio, 99. Jehoshua Buc eceive the BSc a MSc egees i electical egieeig fom the Techio, Isael Istitute of Techology, i 98 a 985, espectively, a the PhD egee i electical egieeig fom Stafo Uivesity i 989. He is a associate pofesso of computatio a eual systems a electical egieeig at the Califoia Istitute of Techology. His eseach iteests iclue paallel a istibute computig, fault-toleat computig, eocoectig coes, computatio theoy, a eual a biological systems. D. Buc has extesive iustial expeiece icluig, sevig as maage of the Fouatios of Massively Paallel Computig Goup at the IBM Almae Reseach Cete fom 99-994, a eseach staff membe at the IBM Almae Reseach Cete fom 989-99, a a eseache at the IBM Haifa Sciece cete fom 98-985. D. Buc is the ecipiet of a 995 Sloa Reseach Fellowship, a 994 Natioal Sciece Fouatio Youg Ivestigato Awa, six IBM Plateau Ivetio Achievemet Awas, a 99 IBM Outstaig Iovatio Awa fo his wo o Hamoic Aalysis of Neual Netwos, a a 994 IBM Outstaig Techical Achievemet Awa fo his cotibutios to the esig a implemetatio of the SP-, the fist IBM scalable paallel compute. He has publishe moe tha joual a cofeece papes i his aeas of iteests a he hols patets. D. Buc is a seio membe of the IEEE a a membe of the eitoial boa of the IEEE Tasactios o Paallel a Distibute Systems.

56 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO., NOVEMBER 997 Chig-Tie Ho eceive a BS egee i electical egieeig fom Natioal Taiwa Uivesity i 979 a the MS, MPhil, a PhD egees i compute sciece fom Yale Uivesity i 985, 986, a 99, espectively. He joie IBM Almae Reseach Cete as a eseach staff membe i 989. He was maage of the Fouatios of Massively Paallel Computig goup fom 994-996, whee he le the evelopmet of collective commuicatio, as pat of IBM MPL a MPI, fo IBM SP- a SP- paallel systems. His pimay eseach iteests iclue commuicatio issues fo itecoectio etwos, algoithms fo collective commuicatios, gaph embeigs, fault toleace, a paallel algoithms a achitectues. His cuet iteests ae ata miig a o-lie aalytical pocessig. He has publishe moe tha 8 joual a cofeece papes i these aeas. D. Ho is a coecipiet of the 986 Outstaig Pape Awa of the Iteatioal Cofeece o Paallel Pocessig. He has eceive a IBM Outstaig Iovatio Awa, two IBM Outstaig Techical Achievemet Awas, a fou IBM Plateau Ivetio Achievemet Awas. He has patets gate o peig. He is o the eitoial boa of the IEEE Tasactios o Paallel a Distibute Systems. He will be oe of the pogam vice chais fo the 998 Iteatioal Cofeece o Paallel Pocessig. He has seve o pogam committees of may paallel pocessig cofeeces a woshops. He is a membe of the ACM, the IEEE, a the IEEE Compute Society. Eli Upfal eceive a BSc i mathematics fom the Hebew Uivesity i 978, a MSc i compute sciece fom the Weizma Istitute i 98, a a PhD i compute sciece fom the Hebew Uivesity i 983. Duig 983-984, he was a eseach fellow at the Uivesity of Califoia at Beeley, a, i 984-985, a postoctoal fellow at Stafo Uivesity. I 985, D. Upfal joie the IBM Almae Reseach Cete, whee he is cuetly a eseach staff membe i the Fouatios of Compute Sciece Goup. I 988, he also joie the Faculty of Applie Mathematics a Compute Sciece at the Weizma Istitute, whee he is cuetly the Noma D. Cohe Pofesso of Compute Sciece. D. Upfal s eseach iteest iclue theoy of algoithms, aomize computig, pobabilistic aalysis of algoithms, commuicatio etwos, a paallel a istibute computig. He is a seio membe of the IEEE. W. Deic Weathesby is a PhD caiate i the Depatmet of Compute Sciece at the Uivesity of Washigto, Seattle, Washigto. His cuet eseach ivolves compile optimizatios fo collective commuicatio pimitives, potable softwae suppot fo efficiet collective commuicatio libaies, a paallel pogammig laguage esig. Shlomo Kipis (M 87) eceive a BSc i mathematics a physics i 983 a a MSc i compute sciece i 985, both fom the Hebew Uivesity of Jeusalem, Isael. He eceive a PhD i electical egieeig a compute sciece i 99 fom the Massachusetts Istitute of Techology. Fom 99-993, he woe as a eseach staff membe at the IBM T. J. Watso Reseach Cete i Yotow Heights, New Yo. Fom 993-995, he woe as a eseach staff membe at the IBM Haifa Reseach Laboatoy i Isael. Cuetly, he is woig as maage of ew techologies at NDS Techologies Isael. I aitio, sice 994, D. Kipis has bee a ajuct pofesso of compute sciece at Ba Ila Uivesity a at Tel Aviv Uivesity. His eseach iteests iclue paallel a istibute pocessig, efficiet commuicatio stuctues a algoithms, a system secuity. D. Kipis is a membe of the IEEE, the IEEE Compute Society, ACM, a ILA. He has publishe i umeous jouals a pesete his wo i may cofeeces a woshops. He is also a iveto a coiveto of two U.S. patets.