An Incrementally Scalable Multiprocessor Interconnection Network with Flexible Topology and Low-Cost Distributed Switching.

An Inrementally Salable Multiproeor Interonnetion Network with Flexible Topology and Low-Cot Ditributed Swithing. 1. Introdution Ronald Poe, Vinent Fazio, Jon Well Department of Computer Siene, Monah Univerity, Clayton, Vitoria 3168, AUSTRALIA rdp@.monah.edu.au Abtrat Maively parallel omputing arhiteture are beoming widely aepted in many omputationally intenive area. One of the prime advantage touted i their alability, and yet while in priniple a good degree of alability i poible, in pratie the unit of aling i very oare. The net effet of thi i to make inremental expanion of uh mahine impratial exept for large and expenive expanion to multiple of the original ize and in ome ae expanion by power of two i required. In thi paper a multiproeor interonnetion heme with low fixed overhead and linear inremental aling ot i deribed. When performane limitation of onventional ingle proeor omputer ytem firt enouraged people to integrate a number of proeor and memorie into a ingle ytem the approah taken wa to ue either a hared memory bu or a with interonneting proeor and memorie. A hared bu provide a very heap and imple interonnetion heme but an beome aturated and loe performane with only a moderate number of proeor. Swith baed interonnet are muh more expenive and an have high performane but are even more limited in the number of proeor whih an be attahed diretly. Variou hierarhie and network of withe have been employed in attempt to ale the ytem beyond the apaity of a ingle with or bu. Hene wa born the idea of a maively parallel omputing ytem with hundred or thouand of proeor interonneted either with a hared memory truture or via a truture whih allow meage to be routed between the proeor. Maively parallel omputing arhiteture are beoming widely aepted in many omputationally intenive area. One of the prime advantage touted i their alability, and yet while in priniple a good degree of alability i poible, in pratie the unit of aling i very oare. The net effet of thi i to make inremental expanion of uh mahine impratial exept for large and expenive expanion to multiple of the original ize and in ome ae expanion by power of two i required. The main problem with aling thee mahine to larger and larger onfiguration i the nature of the interonnetion network. In order to get the bandwidth required to make effetive ue of the mahine ignifiant reoure are dediated to the interonnetion network. The interonnetion network tend to involve the majority of the iruitry and the ot of urrent maively parallel omputer ytem. It alo tend to be built out of large withing unit. Given the relative ot of the withing unit and the proeor, it i not ot-effetive to obtain a mahine with fewer than the number of proeor required to oupy a withing unit. The unit of alability i thu thi number of proeor at bet, and depending on the withing topology may require exponential growth pattern to make effetive ue of the withing invetment. While it may eem reaonable to go from a 4 node ytem to an 8 node ytem, when it ome to expanion granularity muh oarer than thi a i beoming ommon, the inremental ot beome prohibitive. One ha to find the money in large amount rather than by expanding the ytem gradually a need and finane permit. We propoe an arhiteture and interonnetion topology whih ditribute the withing invetment uniformly aro the node of the ytem. It i thu alable from a ingle proeor-memory node up to over a thouand node in inrement of ingle node. The arhiteture ha been developed from the beginning with inremental alability in mind. Engineering apet of the implementation of the arhiteture have been onidered at all level o a to keep the ot of the ytem down and to make the ytem pratial to build. The ommerial potential of the arhiteture i alo a fator in it deign. The interonnetion mehanim are deigned to be fairly independent of the proeor tehnology being ued. The ytem i in fat deigned not only for inremental expanion but alo for inremental upgrading to the latet fatet proeor without diarding the exiting proeor. 2. Node The bai node in the arhiteture i a ombined proeor-memory-withing element. There i only a ingle kind of node in the ytem. In fat there are no other ative omponent in the ytem. The interonnetion topology i implemented entirely through low ot paive interonnetion between the node. A variety of interonnetion topologie i poible with thi approah with different performane tradeoff. The node omprie a number of ubytem inluding at leat one proeor unit, at leat one memory unit, two parallel bu port, a high peed erial onnetion and a withing heme to route traffi between thee element. The proeor unit an read and write the loal memory unit and an diret memory referene to memory unit on other node through

the bu port or the fat erial link. The memory unit an aept operation from the loal proeor unit, or from either bu port or the fat erial link. Failitie are alo available to allow a proeor to intrut a memory unit to opy data into another memory unit in a different node. The withing hardware handle the routing between thee element and indiretly through the larger interonnetion topology. In effet the ytem funtion a a ditributed hared memory with a non-uniform ae time. A node i depited in figure 1. Node Proeor Proeor Swith Memory Memory Node additional bak-to-bak onnetion point to onvert planar into ylindrial topology Bu I/F High Speed Serial Link Bu I/F Bu additional bak-to-bak onnetion point to onvert ylindrial into pherial topology Bu Ghz Link Bu Figure 1. Struture of a Node Figure 2. Planar 4 node per bu topology Beaue there i only a ingle type of node the ytem i well uited to eonomial ommerial ma prodution. A ingle node ontain memory and proeor on a ingle board and ould form the bai for a onventional worktationtyle omputer itting on omeone' dek. A lightly larger ytem with a moderate number of node an it in the orner of the room. A large ytem with over a thouand node ould live in a mahine room. Idential board are ued in eah and extra board may be added one at a time. 2.1 The Proeor A node may ontain one or more proeor unit. While a proeor unit typially i baed on a 64-bit RISC miroproeor, the only real requirement are that it be a devie whih an generate addree and data and aept data it may have requeted. The ommuniation protool and mehanim ued ha to be onitent with that employed by other unit in the node but need not be idential on every node. A proeor unit may be implemented a a plug in module or daughter board to allow for eay maintenane and future upgrade. Typially a proeor unit i mathed in peed and tehnology with the memory unit in the ame node however thi i not eential a long a reliable ommuniation i ahieved. It i to be expeted that modern high peed proeor unit will ontain appropriate ahe to minimize the impat of memory lateny and to minimize the amount of memory traffi. 2.2 The Memory A node may ontain one or more memory unit. A memory unit typially will omprie a memory array, aoiated error detetion and orretion and ontrol logi, and a ontrolling devie whih an be intruted to opy a blok of data to another memory unit via one of the bu interfae or over the high peed erial link. Obviouly it mut alo be able to aept uh a blok of data from a bu interfae. In a node equipped with proeor unit baed on modern 64-bit RISC tehnology, it would be expeted that a memory unit would ontain at leat 64 Mbyte of memory. Logially though it mut be a module whih aept an addre, a funtion and perhap ome data, and may if required return ome data. It mut be able to ommuniate with both the bu interfae and erial link and with the proeor unit on the node. A with proeor unit, a memory unit may be implemented a a plug in module or daughter board o a to allow upgrading to higher performane or larger memory ize. 2.3 External Communiation Interfae While a proeor unit and a memory unit together make up a node whih i the equivalent of a modern ingle board omputer, the aim of thi ytem i to allow uh mahine to be aled into large multiproeor ytem. For thi to be ahieved there ha to be a mean of ommuniating between node. Two different kind of external ommuniation interfae have been devied for thi purpoe. Reaon for having both mehanim are diued in the engineering etion. Two bu interfae are provided on eah node. Thee are 64-bit parallel bu interfae with aoiated ontrol logi and an implementation of a two level ommuniation protool for ommuniating with other node. The other end of the bu interfae ommuniate with all the other unit in the node, inluding the other bu interfae and the high peed erial link. It i expeted that large buffer would be inluded in the bu interfae. Thee ould at a a kind of write buffer for the other unit allowing them perhap to ontinue with other ativitie while internode tranfer are taking plae. Another key feature i that there i no ommon lok in the ytem a a whole. Our deign philoophy of a ingle kind of ative node interonneted paively preluded the onventional entralized lok ditribution approah whih anyway i

rather diffiult to implement in a large ale ytem. Eah node run at it own peed. Thi avoid a major lok ditribution problem and allow upgrading to fater node on an individual bai rather than having to upgrade the whole ytem. It alo mean that eah bu interfae tranmit data uing it own lok ine no other lok i available. By tranmitting the lok along with the data bu interfae an reeive data uing the lok at whih it wa tranmitted. An arhiteture employing FIFO a both buffering element and ynhronization element ha been devied (Catro & Poe 94). The high peed erial link perform the ame funtion a the bu interfae unit in that it allow unit to ommuniate with unit in other node. The tehnology employed in the high peed erial link i quite different. Intead of being baed on a 64-bit parallel bu, the external interfae of the high peed erial link i a pair of unidiretional 1 GHz erial onnetion over oaxial able or fibre opti able. Even at 1 GHz thi link will be lower than the 64-bit parallel interfae whih ould run at over 50 MHz, however the tehnology employed allow for different phyial arrangement of the ytem a will be diued later. 2.4 Swithing within a Node The key to our approah i that the withing omponent whih dominate the traditional maively parallel mahine ha been ditributed equally aro all node and i a moderate overhead on eah board whih an even be ignored for a ingle node worktation. The variou unit making up a node indeed need to ommuniate with one another. Beaue there are only a mall number of unit in a node it i not neeary to invet in large ale expenive withing apparatu. It i wie however to allow eah of the unit to operate to it apaity o a ingle bu interonneting all the unit i not appropriate. At the other extreme a mall ro-bar with interonneting the unit i unneeary although not too unreaonable given the mall number of unit. A good performane tradeoff an be ahieved by employing a mall number of bue whih allow the mot heavily ued inter-unit ommuniation to proeed onurrently. 3. Interonnetion Network Topology The node form a imple building blok from whih one an ontrut large ytem interonneted in a variety of topologie. Given that the off-node interonnetion hardware ha to be paive to preerve our goal of a ingle ative omponent type, and that it mut be poible to have pratial implementation from the engineering viewpoint, we have devied a heme whih allow 6 different topologie to be implemented with differing harateriti. 3.1 Planar Topology with 4 Node per Bu In thi topology eah node i onneted to two bue, one for eah bu interfae. Every bu in the ytem ha a maximum of 4 node on it. Remember that bue are jut a paive et of parallel wire. A planar topology with 4 node per bu i depited in figure 2. 3.2 Cylindrial Topology with 4 Node per Bu While it i indeed poible for any node in the ytem to ommuniate with any other node by paing data through intermediate node it i lear that the path length and hene the ommuniation time will be non-uniform. It i poible to redue the maximum path length by rolling the plane into a ylinder. For intane if one examine the planar topology depited in figure 2 one will ee that there are bue on the left and right end of the plane whih have only 2 node on them. Logially one may ombine thee bue into ingle bue with 4 node eah. Thi may be likened to rolling the plane into a vertial ylinder but i better realized from the engineering tandpoint by produing a flattened ylinder by plaing two maller planar bakplane bak-to-bak. 3.3 Spherial Topology with 4 Node per Bu Jut a the obervation wa made that the left and right end of the planar topology were not fully populated, it i alo evident that ome bue at the top and bottom of the planar topology and the ylindrial topology are alo not fully oupied. Logially thee may alo be ombined by folding in the other diretion and you atually end up with a kind of pherial topology. Thi alo work out very well from the engineering viewpoint in that the bak-to-bak flattened ylinder ontrution alo ha the appropriate under oupied bue for a flattened phere loated bak-to-bak allowing a imple interonnetion to ombine them. For a given number of node a pherial topology ha a horter average path length than a ylinder, whih itelf ha a horter path length than a planar topology. 3.4 Topologie with 6 Node per Bu While we indeed have redued the maximum and average path length between node in going from a planar topology to a ylindrial one to a pherial one, it i poible to further redue the average path length by inreaing the number of node per bu. By doing thi one need fewer bue for a given number of node. Of oure one inreae the traffi on eah bu by doing thi. A planar topology with 6 node per bu i depited in figure 3.

Node Bu Note that while thi topology an alo be onneted into ylindrial and pherial arrangement, due to the aymmetrial bu hape one an get different number of node per bu around the edge of the bakplane. It i even poible to end up with 8 node on ome edge bue but thi an be avoided by removing a node. Figure 3. Planar 6 node per bu topology Note that thi depition of a 6 node per bu topology indiate that the bue are bent. It i ueful to employ uh bue in our ytem ine we do not want to violate our deign priniple of a ingle kind of node, and by arranging bent bue we an plug in node idential with thoe ued in the 4 node per bu ytem. It i alo very onvenient from the engineering perpetive to ue uh bent bue. Careful omparion of figure 2 and 3 reveal that the vertial part of the bue in figure 3 orrepond to node in figure 2. In fat one an take the idential bakplane truture that wa ued for the 4 node per bu topology and jut replae every eond node board with a paive et of wire linking the two bue. Thi may be ahieved with a dummy node board only ontaining wire with no ative omponent. Jut a in the ae of topologie with 4 node per bu, it i alo poible to fold the plane into a ylinder or phere. In fat the idential truture of bak-to-bak bakplane an be employed. All ix topologie may be reated from ommon omponent. One jut plug in the node or wire to form the deired topology. 3.5 Another Dimenion Thu far we have only onidered the ue of the bu interfae unit to interonnet node. Certainly it may be reaonable to ontrut ytem with 32 or 64 or even more node plugged into a bakplane, however the path length will be inreaing with the ize of the ytem and loading on the bue ould alo beome too great due to a lot of traffi jut paing through. We till have the high peed erial link running at 1 GHz. Coneptually we have the node arranged in plane, ylinder, or phere, and we have engineered the ytem to have the node board plugged into bak-to-bak bakplane. One ould alo tak thee topologie in another dimenion. While that would be diffiult to arrange phyially with bakplane bue, the flexibility afforded by the ue of long thin able for the high peed erial link allow u to realize another dimenion of interonnetion. Thi may be viualized by onidering the planar topology uing a ingle bakplane, and then taking uh bakplane and node above one another with the high peed erial link running vertially between the node whih are diretly above one another. The ame approah work with ylindrial and pherial topologie ine we are not phyially ontrained uing the high peed erial link and an even geographially ditribute the ytem to ome extent. Conider the ytem to be layer of plane, ylinder or phere. A limitation of the tehnology ued to build the high peed erial link retrit u to a ingle tranmitter per link. Thi mean that a bu truture i not poible. A ingle tranmitter an end data to reeiver on node vertially aligned in the oneptual topology. Thu a node an tranmit to another node on every layer of thi taked topologial truture. In order to enure that the ytem i fully interonneted it i neeary to enure that there i at leat one tranmitter on eah layer. Thi mean that the overall ytem an have no more layer than there are node per layer. It an however have fewer layer. A fairly large ytem may well omprie 2048 node arranged in 32 layer of ylindrial 4 node per bu topology bakplane, eah ontaining 64 node. Eah node may well ontain 2 64-bit proeor unit and memory unit ontaining 128 Mbyte. Thu the ytem would ontain 4096 proeor able to read and write 256 Gbyte of memory. 4. Routing Having devied a ytem in whih we an have thouand of node it i neeary to be able to move data effiiently between them. Our topologie allow multiple path between any two node. Of oure not all path are equal; ome will be longer than other, however in general there will often be multiple poible path of minimum length between node. The redundany of path available in our ytem i one of it trength, providing the potential for a degree of fault tolerane and potential to ditribute traffi o a to avoid hot pot and other problem. One doe however have to determine the path that data will take in travering the ytem. We till have our ontraint we have impoed that all node be equivalent and that there be no other ative omponent. Thi implie that any form of dynami routing heme will have to be ditributed aro all node. We alo et ourelve an aim that the alability of the ytem hould be inremental and that we do not wih to bear the ot and omplexity aoiated with a large ytem when we may only have a mall ytem, even a ingle node. A fairly tati routing heme would eem to be indiated to meet thee riteria.

In fat the node truture and topologie we have devied allow an extremely heap and imple routing trategy to be employed. Upon examining the node a depited in figure 1 you will notie that there are very few poible way for data in a node to go. The poibilitie are that the data ould go to one of the memory unit or proeor unit, whih in the ample node depited are on a ommon proeor-memory bu, or the data ould be detined for one of the bu interfae or high peed link. Auming that data originating in a unit on the node' proeor-memory bu and detined for another node on that bu an jut be handled in the onventional manner for bue, then we only have to worry about data originating in a node whih i detined for other node. There are only three way to get to another node, the two bu interfae and the high peed erial link Hene a 2 bit wide look-up table an be ued to indiate whih way to go. If we ue the mot ignifiant bit of the phyial addre a a node number and we only allow a maximum of ay 4096 node then uh a look-up table will fit in a ingle memory hip. Of oure data may not have originated in a partiular node. In uh a ae it mut have entered the node through one of the three external interfae and one again an have three poible detination, the other two external interfae or a unit on the proeor-memory bu. One again a imilar look-up table an be ued in eah ae. The operating ytem an onfigure the look-up table to reflet appropriate routing for the variou poible interonnetion topologie. In fat the operating ytem an determine the atual interonnetion in the ytem by probing the diretly onneted bue to determine it nearet neighbour and exhanging information between neighbouring node o a to build up a map of the ytem. A routing algorithm an then be ued to determine the routing information to be loaded into the variou look-up table. Upon failure in part of the interonnetion network the routing look-up table ould be hanged to avoid the problem. It i alo poible to handle tranient network problem uh a ongetion by allowing for ome dynami rerouting. If data wa not detined for the loal node then it i poible to hooe to ignore the advie given in the routing look-up table and diret the data on another path. Given the redundany in the interonnetion network it i quite poible that an alternate path will be found even if not the hortet one. Of oure one would not hooe to ignore the routing lookup table unle there were a good reaon uh a a link failure. In that way the intrini fault tolerant apet of the ytem an be exploited at very low ot. We annot of oure guarantee that a path will be found in the ae of failure in the network but with appropriate routing algorithm ued to load the look-up table, one ould maximize the hane of thi imple-minded approah finding a ueful path in a reaonable time. 5. Evaluation of the Interonnetion Sheme The interonnetion heme outlined above ha everal advantage over heme employed in urrent maively parallel mahine. We et ourelve a hallenging aim in trying to ahieve aeptable performane and yet not wanting to bear the ot of uh performane in term of omplexity, alability, and of oure the atual expene of building the ytem. To that end we redued the omplexity down to a kind of imple withing truture whih form a mall part of eah node and whih i ontrolled by an extremely heap and imple routing ytem baed on mall look-up table. The heap inremental alability i thu enured a long a we an provide the required performane. The onventional approah to ahieving performane i to horten the average and maximum path that data ha to travere and to repliate the data path o a to allow onurrent operation on many path. By hortening the path one redue lateny and by inreaing onurreny one inreae bandwidth. Exiting ytem have tended to take the approah of large and expenive withe to inreae thee parameter or ele the ue of variou higher dimenionality network uh a hyperube. Both thee approahe lead to poor inremental aling or poor fault tolerane. We have indeed taken the maxim of inreaing the degree of onnetivity and inreaing the degree of onurreny, however our method are quite different. Taking the bu a the implet multiple node interonnetion heme, we have reognized that it ha limited bandwidth, and o a not to aturate it, we limit the number of node on it to a mall number, 4 or 6. To inreae the degree of onnetivity we have intead onneted the node to two bue, and to a high peed erial link. In thi way we have ahieved our aim without overloading individual bue. A a ide benefit thi lead to a degree of fault tolerane. In a imilar fahion we an ue different topologie with variou tradeoff between bu loading and path length. It i intrutive to ompare thi topology with that of a meh whih ha employed in many ytem. A meh typially ha 4 or 6 external onnetion per node. Beaue of phyial ontraint thi i only poible if the onnetion are not too wide. We have ahieved an effetively higher degree of interonnetivity with a far impler ytem. Uing 4 node per bu we get eah node onneted diretly to 6 other node; uing 6 node per bu we get eah board onneted diretly to 10 other node. That i ignoring the high peed erial link whih ould ignifiantly inreae that onnetivity. It thu appear that we an ahieve imilar performane harateriti with lower ot and better alability harateriti. 6. Engineering Right from the beginning the ytem wa deigned with the engineering pratialitie in mind. Thi led to a rather pragmati approah to deviing the variou interonnetion topologie. Nothing wa onidered that ould not be built in a imple and eonomial way. The ytem deribed i realizable uing off the helf omponent. While it performane ould be improved through the employment of aggreive tehnology it wa deired firt to get a lean and imple deign whih allowed not only for aling of the ytem but alo for inrementally updating the tehnology employed in variou omponent. A uh the node were defined a ompriing eparate proeor unit and memory unit and interfae unit, eah of whih ould be upgraded to the latet tehnology. The interonnetion themelve were peified a being paive. Thi wa to allow for eonomial ommerial prodution. By tiking rigidly to a ingle ative entity in the ytem, the node, we have enured that hould the ytem

prove ommerially viable that ma prodution of a ingle omponent an keep expene down. The bakplane and bu layout i uh that mehanially rigid aemblie an be made with olid onnetor providing trutural rigidity a well a eletrial onnetivity. The high peed erial link not only dramatially inreae the effetive onnetivity but they allow the aemblie to be plit into eparate abinet and even geographially ditributed. They alo provide an idea point of attahment of input / output ytem. The erial interonnetion i a bit lower than the parallel bu interonnetion and may be onidered equivalent to a ertain number of hop aro the bakplane bue by the algorithm plaing work on the proeor. While it may eem pointle to employ our planar or ylindrial topologie it i poible that one may want to attah other devie inluding input / output devie to the ytem. The lightly loaded bue, ideally loated on the rim of the bakplane, allow onvenient interonnetion to external ytem. It i alo poible to ue ome imple logi to allow the topology to be hanged without phyially reonfiguring the ytem. Sine the topology i defined eentially by the preene or abene of onnetion, intead of phyially onneting and dionneting the omponent, the onnetor ould trivially be fitted with eletroni enable. Although that would violate the aim of only having the node being ative omponent, the ot of adding uh logi to the bakplane would be inignifiant and ould be an option for experimental mahine or in peial ae where the workload demanded a variety of topologie. 7. Projet Statu The ytem outlined in thi paper i urrently being deigned and implemented in the Department of Computer Siene at Monah Univerity. A hared memory paradigm i implemented although thi allow for meage paing algorithm with no overhead. Performane i related to the loality of memory referene thu managing the loation of proee and their data i important. The projet alo inlude a apability-baed operating ytem whih provide a peritent global virtual memory. Thi operating environment i baed on the Paword-Capability Sytem (Anderon, Poe, Wallae 86) whih ran on earlier generation hared bu multiproeor hardware employing novel addre tranlation tehnique (Poe, Anderon, Wallae 87; Poe 89). 8. Conluion We have outlined a multiproeor interonnetion heme with whih a non-uniform memory ae, ditributed memory multiproeor an be produed. It ha uperior propertie with regard to inremental alability ompared with mot ompeting ytem. Attention to engineering onideration ha led to an eonomial ytem with great ommerial potential. While the abolute performane of thi ytem and it interonnetion network will not reah the level of ome of the more elaborate heme being ued elewhere, the impliity and low ot of the approah hould make the ytem very attrative in term of performane per dollar and it inremental alability and the ability to upgrade graefully to fater proeor add to it attration. Referene Anderon, Poe, Wallae 1986. Catro and Poe 1994. Poe, Anderon, Wallae 1987. Poe 1989. Anderon, M.S., Poe, R.D., Wallae, C.S., "A Paword-Capability Sytem", The Computer Journal, Vol. 29, No. 1, 1986, pp. 1-8. Catro, M. and Poe, R., "The Monah Seure RISC Multiproeor: Multiple Proeor without a Global Clok", Autralian Computer Siene Communiation Vol. 16, No. 1, 1994, pp. 453-459. Poe, R.D., Anderon, M.S., Wallae C.S., "Implementation of a Tightly-Coupled Multiproeor", Autralian Computer Siene Communiation Vol. 9, No. 1, 1987, pp. 330-340. Poe, R.D., "Capability Baed Tightly Coupled Multiproeor Hardware to Support a Peritent Global Virtual Memory", Pro. 22nd Hawaiian Int. Conf. on Sy. Si. Vol 2, 1989, pp. 1-10.