Optimal Path Routing in Single and Multiple Clock Domain Systems

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 1 Opimal Pah Rouing in Single and Muliple Clock Domain Syem Soha Haoun, Senior Member, IEEE, Charle J. Alper, Senior Member, IEEE ) Abrac Shrinking proce geomerie and he increaing ue of IP componen in SoC deign give rie o new problem in rouing and buffer inerion. A paricular concern i ha cro-chip rouing will require muliple clock cycle. Anoher i he inegraion of independenly clocked componen. Thi paper explore imulaneou rouing and buffer inerion in he conex of ingle and muliple clock domain. We preen wo opimal and efficien polynomial algorihm ha build upon he dynamic programming Fa Pah framework propoed in [17]. The fir algorihm olve he problem of finding he minimum laency pah for a ingle clock-domain yem. The econd conider rouing beween wo componen ha are locally ynchronou ye globally aynchronou o each oher. Boh algorihm can be ued o eimae communicaion overhead for inerconnec and reource planning. Experimenal reul verify he correcne and pracicaliy of our approach. I. INTRODUCTION Three diinc rend will poe new rouing challenge in fuure SoC deign. Fir, SoC will uilize everal IP (Inellecual Propery) componen, boh of and hard, like embedded proceor and memorie. Thi apec will allow IP reue and i will reduce ime o marke. A hore pah for a ignal beween wo chip componen may hu be obruced by IP block. Second, he drive for higher performance will coninue o puh he clocking frequencie. Third, hrinking proce geomerie and improvemen in proce echnologie allow building bigger die. Muliple clock cycle will be required o cro a chip. Furhermore, if he IP are clocked a differen frequencie, a i he cae wih a hard IP ha ofen ha a fixed clock period, hen he ignal roue mu cro from one ime domain (of he ender) o anoher (ha of he receiver) lached hrough he proper circuiry. In conra o a yem wih a ingle clock, or a ingle-clock domain yem, a yem wih muliple clock i ofen referred o a a mulipleclock domain yem. The clocking cheme i referred o a Globally Aynchronou Locally Synchronou, or GALS [3], [10]. When muliple clock cycle are needed o roue a ignal acro a chip clocked by he ame clock or i derivaive, hree oluion are poible. The fir oluion i combinaional where he delay of he buffered ignal from ender o receiver i more han one clock cycle. The receiver hen uilize Manucrip received Augu 2, 2002; revied May 19, 2003. Thi work wa uppored in par by NSF CAREER gran 0093324, and by a Tuf Univeriy Mellon gran. Soha Haoun i wih he Compuer Science Deparmen a Tuf Univeriy, Medford, MA, 02155, U.S.A. Charle J. Alper i wih IBM Auing Reearch Laboraorie, Auin, TX, 78758, U.S.A. circuiry o coun a pre-deignaed number of cycle before laching new daa. The diadvanage of hi echnique i ha conecuive end canno be overlapped and he hroughpu of he channel i eriouly degraded. A econd oluion i pipelining he roued ignal hrough ynchronizer (edge-riggered regier or level-eniive lache). Buffer are inered whenever needed o boo he elecric ignal and opimize he delay in beween he ynchronizer. The clock ignal i roued o each of he ynchronizer. The hird oluion i wave pipelining. I eliminae one or more regier along he roue, hu allowing he imulaneou exience of everal wavefron along a wire. The key in making wave pipelining a ucce i enuring ha ucceive waveform do no inerfere. While reducing he clock load and eaing clock diribuion, wave pipelining i very eniive o delay, proce, and emperaure variaion effec ha are even more pronounced for long roue. When rouing a ignal beween wo ynchronou domain clocked by differen and unrelaed clock, he criical iue ha mu be addreed i meaabiliy: he clock of he laching regier and he daa may wich imulaneouly. The regier oupu hen ele ino an undefined region neiher a logical high nor a logical low. Several oluion have been propoed o alleviae hi problem [4], [15], [14], [11]. The qualiy of he oluion depend on he mean ime-o-failure which ofen require an increae in laency or addiional complicaion in he ynchronizaion cheme. Thi paper addree wo problem relaed o rouing and buffering in fuure SoC deign. The fir problem eek an opimal buffered-rouing pah wih ynchronizer and buffer inerion wihin a ingle-clock domain yem. The objecive i o minimize he cycle laency, or equivalenly, he oal number of regier along he roue. Conrain are impoed o enure ha when a ignal i lached by a regier, he reuling regier-o-regier delay are le han he clock period. We wih o avoid phyical obacle, o roue hrough block uch a daapah, and o buffer he ignal a needed o minimize he laency. The econd problem eek o opimize he rouing wihin a muli-clock domain yem. Here, we adop he muli-domain communicaion circuiry propoed by Chelcea and Nowick [4] which buffer he ignal from one clock domain o anoher via a pecial circui rucure, a Muli-Clock FIFO. Thi MCFIFO boh minimize poin of meaabiliy in he deign and allow for buffering (emporarily holding) daa. If he MCFIFO i placed more han one ender clock cycle away from he ender, or more han one receiver clock cycle away from he receiver, hen ynchronizaion of he roued ne o he appropriae

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 2 clock i needed. Relay Saion, fir propoed by Carloni e al. [2], are uch ynchronizaion elemen. They provide boh ynchronizaion and emporary orage o allow he correc operaion of he MCFIFO. While our algorihm pecifically ue he Relay Saion and he MCFIFO, i can be eaily adoped o uilize imilar ynchronizaion elemen. Thu, our econd problem conider imulaneou rouing, buffer inerion, relay aion inerion, and MCFIFO inerion o achieve he minimum ime laency. Thi correpond o minimizing he laency of ending daa along he ne when he MCFIFO i empy. Our propoed algorihm o olve hee wo problem are baed on he he Fa Pah framework propoed by Zhou, Wong, Liu, and Aziz [17]. The acual Fa Pah algorihm find a minimum delay pah for a ne while imulaneouly exploring all buffering and rouing oluion. We exend he algorihm o handle addiional ynchronizaion elemen uch a regier, relay aion, and MCFIFO while impoing he iming conrain required by he phyical diance and he communicaion proocol. Recen work in rouing acro muliple clock cycle wihin a ingle domain adop differen approache. Lu e al. decribe a regier/repeaer block planning mehod during archiecural floorplanning/inerconnec planning age[13]. Their mehod i baed on idenifying feaible region in which flip-flop and repeaer can be arbirarily inered o aify boh delay and cycle ime conrain. Cocchini exend van Ginneken dynamic programming framework[16] o opimally place regier and repeaer when given a ree rouing opology[5]. Haoun decribe an adapaion o he Fa Pah algorihm o conruc a clocked buffered roue uing level-eniive lache[9]. The algorihm preened here can be uilized eiher for inerconnec planning purpoe or for realizing he final rouing implemenaion. During he deign planning proce, rouing eimae can be achieved during archiecural exploraion o ae communicaion overhead once an iniial floorplan i conruced. Wire widh and layer aignmen are aumed. The early deecion of communicaion overhead allow archiec o explore microarchiecure radeoff ha hide communicaion laencie. Once a roue eimae i obained, he RTL-level deign decripion i updaed o reflec he added laency aociaed wih muli-cycle rouing, and o enure a cycle-accurae deign decripion ha can be properly verified. The algorihm preened here could alo be ued during backend deign o ynheize phyical roue. In hi cae, he roue laency ha been eimaed, and he algorihm deermine he opimal buffer and ynchronizer placemen. The addiion of he regier may complicae he deign flow, verificaion, and DFT. We however believe ha verificaion and eing mehodologie will evolve o accommodae he neceiy of pipelined roue. The remainder of he paper i a follow. Secion 2 preen a deailed overview of he Fa Pah algorihm. Secion 3 inroduce he ingle-domain rouing and ynchronized buffering problem and how how o adap he Fa Pah framework o olve i. Secion 4 overview MCFIFO and relay aion communicaion cheme, hereby leading ino a dicuion of he problem of opimal pah conrucion in deign realized uing muliple clock domain. Finally, Secion 5 preen experimenal reul and we conclude in Secion 6. II. BACKGROUND: THE FAST PATH ALGORITHM In rouing in ingle and muliple clock domain, we wih o explore all rouing and ynchronizer inerion opion wihin a given rouing area. Many apec of he Fa Pah algorihm[17] can be exploied o achieve hi goal. The Fa Pah algorihm find he minimum delay ourceo-ink buffered rouing pah, while conidering boh phyical obacle (e.g., due o IP, memorie, and oher macro block) and wiring blockage (e.g., daa pah). To model phyical and wiring blockage, one may conruc a grid graph G V E (a in [17], [1], [7]) over he poenial rouing area, whereby each node correpond o a poenial inerion poin for a buffer or ynchronizaion elemen, and each edge correpond o par of a poenial roue. Edge in he grid graph which overlap wring blockage are deleed, and node ha overlap phyical obacle are labeled blocked. More preciely, we define a label funcion p : V 0 1 where p v 0 if v V overlap a phyical obacle and p v 1 oherwie. For each edge u v E, le R u v and C u v denoe he capaciance and reiance of a wire connecing u o v. We ue uniform capaciance and reiance for a given lengh auming a fixed widh and layer aignmen. Le R g, K(g), and C g repecively denoe he reiance, inrinic delay, and inpu capaciance of a given buffer or ynchronizaion elemen g. Thee value are deermined for each buffer or ynchronizaion elemen in he rouing library. We ue he reiancecapaciance π-model o repreen he wire, a wich-level model o repreen he gae, and he Elmore model o compue pah delay. A pah from node o in he grid graph G i a equence of node v 1 v 2 v k. An opimized pah from o i a pah plu an addiional labeling m of node in he pah. We have m g, m g, and m v i I 0, where I i he e of buffer or ynchronizaion elemen which may be inered on a node in he pah beween ource and ink. Here, g i he driving gae locaed a, g i he receiving gae locaed a, and each inernal node v may be aigned a gae from he e I or no have a gae (correponding o m v 0). A pah i feaible if and only if p v 1 whenever m v I. We aume ha m i iniialized o m g, m g, and m v 0 for v V. Le B be a buffer library coniing of non-invering buffer. The minimum-delay buffered pah problem, or he Fa Pah, can be expreed a follow: Given a rouing graph G V E, he e I B, and wo node V, find a feaible opimized pah from o uch ha he delay from o i minimized. Thi problem can be opimally olved by he Fa Pah algorihm [17] and alo by he hore pah formulaion propoed by Lai and Wong [12]. The laer formulaion can alo be exended o wire izing. We chooe o exend he Fa Pah algorihm o handle he nex wo formulaion ince i doe no require any lookup able compuaion and i likely more efficien when here i no wire izing.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 3 Fig. 1. Fa Pah G B m Inpu: G V E Rouing grid graph B Buffer library ource node ink node m iniial labeling Var: Q prioriy queue of candidae α c d m v Candidae a v Oupu: m Complee labeling of - pah 1. Q C m 0 m. 2. while ( Q /0 ) do 3. c m b u exrac min Q 4. if c 0 hen reurn labeling m. 5. if u hen d d R m c K m puh 0 d m u ono Q and prune. coninue 6. for each u v E do c c C u v d d R u v c C u v 2 puh c d m v ono Q and prune 7. if p u 1 and m u 0 hen 8. for each b B do c C b d d R b c K b m u b puh c d m v ono Q and prune The Fa Pah algorihm. The main idea behind he Fa Pah algorihm i o exend Dijkra hore pah algorihm o do a general labeling baed on Elmore delay. Le he quadruple α c d m v repreen a parial oluion a node v where c i he curren inpu capaciance een a v, d i he delay from v o, and m i a labeling for he buffered pah from v o. The oluion α 1 c 1 d 1 m 1 v i aid o be inferior o α 2 c 2 d 2 m 2 v if c1 c2 and d1 d 2. For any pah from o hrough v, a buffer aignmen of m 1 from v o i guaraneed o no be beer han a buffer aignmen of m 2 from v o. Thu, α 1 can be afely deleed (or pruned) wihou acrificing opimaliy. Peudo-code of he Fa Pah algorihm [17] i given in Figure 1. The core daa rucure ued by Fa Pah i a prioriy queue of candidae ha key off of he candidae delay value. The algorihm begin by iniializing Q o he e conaining a ingle ink candidae. Then, candidae are ieraively deleed from he Q and expanded eiher o add an edge (Sep 6) or a buffer from he library (Sep 7 and 8). If he ource i reached, i i puhed ono he Q in Sep 5, and when i i evenually popped from he queue, i i reurned a he opimum oluion (Sep 4). Wih each addiion o he queue, candidae for he curren node are checked for inferioriy and hen pruned accordingly. If we aume ha G ha n verice, E 4n (which i rue for a grid graph), and B k, he complexiy of Fa Pah i O n 2 k 2 lognk. III. SINGLE CLOCK DOMAIN ROUTING We now explore he problem of finding a buffered rouing pah from o when muliple clock cycle are required. Rouing over large diance in increaingly aggreive echnologie will require everal clock cycle o cro he die. Hence, one mu periodically clock he ignal by inering ynchronizaion elemen (uch a regier) along he ignal pah. In hi cae, one canno imply rea a regier like a buffer and add he regier delay o he exiing pah delay in he Fa Pah algorihm. The realizable delay beween conecuive regier on a pah will alway be deermined by he clock period, regardle of he acual ignal propagaion ime. Regier-oregier ub-pah wih delay larger han he permiible clock cycle are no permied. T T T T φ φ 4T φ Fig. 2. Example of he ingle clock domain rouing. The laency i deermined by he number of regier. Le r denoe he regier o be ued for inerion, T φ he clock period, and Seup r o be he eup ime for r. We exend he definiion of node labeling o permi regier aignmen, i.e., m v r for any node v V. We alo aume ha he ource and ink are ynchronizaion elemen, o ha g g r. We define he clock period conrain a follow: a buffer-regier pah i feaible if and only if p v 1 whenever m v I and he buffered pah delay beween conecuive regier i le han or equal o T φ Seup r. Since a regier releae i ignal wih each clock wich, he - pah laency i given by T φ! p " 1, where p i he number of regier on he - pah. For example, Figure 2 how an - pah wih hree regier beween and, which mean i ake four clock cycle o ravere from o. Noe ha in he figure he conecuive regier have differen pacing, bu he delay i alway meaured a T φ beween regier. Figure 3 how an example of a buffered-regier pah on a grid graph wih boh circui and wire blockage. The phyical area occupied by a regier can be differen from ha occupied by a buffer. The regier area i a funcion of he underlying circui echnique. In addiion, rouing he clock o clock he regier migh caue added congeion. Our algorihm can be eaily modified o allow regier blockage ha preven inering regier a undeirable grid poin. The problem of finding he minimum buffer-regier pah from o can now be aed a: Problem 1: Given a rouing graph G V E, he e I B r, and wo node V, find a feaible buffer-regier pah from o uch ha he laency from o i minimized. The objecive i alo equivalen o minimizing v m v r #. To olve Problem 1, one migh iniially ry applying he Fa Pah algorihm and rea he regier like a member of he buffer library wih he following cavea: candidae which violae he regier o regier delay conrain are immediaely pruned. However, he Fa Pah pruning cheme will no behave correcly. Conider he wo parial oluion from v o in Figure 4. Here d 1 and d 2 are he delay from v o he fir regier φ φ

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 4 Fig. 3. circui blockage wire blockage An example of rouing wihin a ingle-clock domain. pah 1 pah 2 d 2 d 1 v Fig. 4. An example of parial rouing oluion ha canno be compared for pruning. in he op and boom pah, repecively. The op pah ha delay 2T φ " d 1, while he boom pah ha delay T φ " d 2. Since feaibiliy require boh d 1 and d 2 mu be no greaer han T φ Seup r, he boom pah delay i le han he op pah delay. In addiion, ince here i a buffer on he boom pah cloe o v, v ee le downream capaciance on he boom pah han on he op pah. Since he op pah i inferior o he boom pah in erm of capaciance and delay, he fa pah algorihm would prune he candidae correponding o he op pah. However, conider coninuing he roue o node u on he oher ide of he circui blockage from v. I i cerainly poible ha he delay from u o v plu d 2 exceed he delay feaibiliy conrain, while he op pah delay from u o v plu d 1 doe no. In hi cae, only he op pah can uccefully be roued from v o u while ill meeing feaibiliy requiremen. Conequenly, he op pah canno be pruned. The key obervaion i ha one hould only prune ubpah by comparing o oher ub-pah wih he ame number of regier. In he previou example, comparing a oneregier pah o a wo-regier pah lead o an unreolvable inconiency. Had he boom pah had wo regier, hen i could no have had maller delay han he op pah. Thu, one can ill ue he Fa Pah algorihmic framework a long a candidae propagaion proceed in wave of parial u oluion wherein each wave correpond o a differen number of regier. Figure 5 how how one can adap he Fa Pah framework o accomplih hi in he Regiered-Buffered Pah (RBP) algorihm. RBP Algorihm G B m r T φ Inpu: G V E Rouing grid graph B Buffer library ource node ink node m iniial labeling wih m m r r regier for clocking ignal T φ required clock period. Var: Q prioriy queue of candidae Q queue holding nex candidae wave α c d m v Candidae a v A Marking of regiered node Oupu: m Labeling of complee - pah 1. Q C r Seup r m. Q /0, A v 0, v V 2. while ( Q /0 ) and ( Q /0 ) do if ( Q /0 ) hen Q Q, Q /0. 3. c d m u exrac min Q 4. if u hen d d R m c K m if d T φ hen reurn labeling m. 5. for each u v E do c c C u v d d R u v c C u v 2 if d T φ K r min R B r c hen puh c d m v ono Q and prune 6. if p u 1 and m u 0 hen 7. for each b B do c C b, m u b d d R b c + K(b) if d T φ K r hen puh c d m u ono Q and prune 8. if A u 0 and d R r c K r T φ hen m u r, A u 1 puh C r Seup r m u ono Q Fig. 5. The Regiered-Buffered Pah (RBP) Algorihm. The primary difference beween RBP and he Fa Pah algorihm are a follow: RBP ue a econd queue Q o ore candidae oluion for he ubequen propagaion wave. When a regier i added o a candidae ha i popped from Q i i added o Q and proceed only afer he curren wave i compleed. Thi puhing ono Q i accomplihed in Sep 8, whereby candidae are added only if he inerion of he regier aifie he clock feaibiliy conrain. RBP combine Sep 4 and 5 from Figure 1 ino a ingle Sep 4. RBP ha he luxury of knowing ha a oon a i reached, a minimum laency oluion i guaraneed, hence i can immediaely reurn he oluion, a oppoed o puhing i back ono he queue like Fa Pah. When inering a regier for a candidae a node v, i i guaraneed o be he minimum laency candidae from v o uch ha m v r. Hence, here i no need o conider oher candidae wih m v R. We ue he array A o mark wheher a oluion wih m v r ha been generaed

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 5 for node r o preven muliple candidae wih regier inered a v. Sep 8 only add a regier o an exiing candidae if none already exi. The clock feaibiliy conrain i checked before puhing new candidae ono Q in Sep 7 and 8. Thi preven oluion ha can never lead o feaible oluion from furher exploring he grid graph. RBP proceed by expanding all buffered-pah oluion, ju like Fa Pah, unil any furher exploraion violae he clock period conrain. A hi poin Q conain everal newly generaed candidae all of which are ready for wavefron expanion from a node wih an inered regier. Sep 2 dump hee candidae ino Q and re-iniialize Q o he empy e. Thee ingle-regier candidae are hen expanded, generaing double-regier candidae ha are ored in Q, ec. When here are no blockage, he wave-fron expanion look like Figure 6. Of coure, wih blockage, he wave fron are no nearly a regular. 8 7 6 5 4 3 2 1 Fig. 6. Wave-fron expanion (a in [6]). Le N be he number of node ha can be reached from a given node in one clock cycle. When he clock period i ufficienly hor, N n, he complexiy of he RBP algorihm i O nnk 2 lognk. The analyi i imilar o ha of he Fa- Pah algorihm [17]. Since he number of elemen in he queue can be a mo O Nk, he ime required o iner an elemen ino he queue O log Nk. Becaue he number of inerion operaion of eiher buffer or lache i bounded by nnk 2, he complexiy i O nnk 2 log Nk. Thi i a leer ime complexiy han he Fa Pah algorihm. The compuaional aving occur becaue we do no have o wae reource exploring he many pah ha violae he clock period conrain. Thi peedup can be een in pracice in he experimenal reul when oberving he number of configuraion ha are examined a well a he run ime. While our algorihm opimize for roue laency, i can be eaily modified o find he minimum-laency pah ha maximize he um of he lack a he ource and a he ink. Each candidae oluion will keep rack of he he ink lack value. When he algorihm reache he ource (i.e. in Sep 4 in Figure 5), all minimum-laency oluion are explored o find he one ha maximize he um of he ink and ource lack. I i alo inereing o noe ha he RBP algorihm could be implemened in an alernaive manner. Inead of uing wo queue, one could ue an array of queue indexed by he number of regier. A new candidae i hen inered ino he queue indexed by he number of regier in he oluion repreened by ha candidae. The ime complexiy of he algorihm would no change, hough hi implemenaion ue more memory. A. Background IV. MULTIPLE CLOCK DOMAIN ROUTING A menioned in Secion I, we adop he MCFIFO propoed by Chelcea and Nowick [4] o roue a ignal beween wo differen clock domain. The MCFIFO i he baic eniy ha eablihe daa communicaion beween wo module operaing a differen frequencie. Like all FIFO, he MCFIFO ha a pu inerface o he ender and a ge inerface o he receiver. Each inerface i clocked by he communicaing domain clock a illuraed in Figure 7. If he full ignal i no aered, hen he ender can reque a pu (Pu Reque ignal) and daa i placed on he Pu Daa wire. The daa i lached ino he FIFO a he nex edge of he ender clock. If he empy ignal i no aered, hen he receiver can reque daa (Ge Reque ignal). The daa i hen made available a he receiver nex clocking edge. The Ge I Valid ignal deermine if he daa i valid. Sender Clock Domain Full Pu Reque Pu Daa Sender Clock Receiver Clock Domain Ge Reque Ge i Valid Empy Ge Daa Receiver Clock Fig. 7. The Mixed Clock FIFO ha inerface wo differen clock domain[4]. Becaue i may ake muliple ender clock cycle o roue a ne from i ource in he rouing grid o he MCFIFO, and muliple receiver clock cycle o roue he ne from he MCFIFO o he ink, ignal mu be ynchronized o he clock of each domain. Chelcea and Nowick exend he concep of a ingle-domain relay aion [2] o do o. Thee aion eenially allow breaking long wire ino egmen ha correpond o clock cycle, and hen a chain of relay aion ac like a diribued FIFO. The ingle-domain relay aion i hown in Figure 8. I conain a main regier and an auxiliary one. Iniially boh are empy and he conrol elec he main regier for oring

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 6 and reading he packe. When he SopIn ignal i aered, he nex incoming packe i ored in he auxiliary regier. SopOu i alo aered on he nex clock cycle o indicae ha he relay aion i full and canno furher accep new daa. packein wich main regier mux packeou T T T 2T + 2T Fig. 10. An example of an MCFIFO-regier rouing. The MCFIFO in break he period ino wo clock domain where he period i T on he ource ide of he MCFIFO and T on he ink ide. The oal laency i 2T 2T. T auxiliary regier conrol CLK differen period, he oal laency i given by 2T " 2T. Thi correpond o he ignal flow auming an empy MCFIFO and ignore he wor cae ynchronizaion delay wihin he MCFIFO ha i common o all rouing oluion. Fig. 8. SopOu A relay aion[2]. SR SopIn To adap he ingle-domain relay aion o inerface properly wih he MCFIFO, he relay aion bundle he Pu Reque and Pu Daa a he incoming packe, and he Ge I Valid and Ge Daa ignal a he ougoing packe. The full ignal in he MCFIFO i ued o op he incoming flow of packe. An MCFIFO and wo adjacen relay aion are hown in Figure 9. Clocked wih T Clocked wih T FIFO circui blockage wire blockage Fig. 9. Sender Clk Relay Saion Sender Clock domain packein Pu Reque Pu Daa SopOu Mixed Clock FIFO Receiver Clock domain Ge i Valid Ge Daa SopIn packeou Relay Saion The Mixed Clock FIFO and relay aion[4]. B. The GALS Algorihm Receiver CLK Alhough for he Muli-Clock domain problem we are uing he MCFIFO and relay-aion ha require bidirecional ignal flow, we abrac he communicaion a being ingle direcional. We view relay aion a a regier r becaue boh have imilar delay properie. Given any buffered pah beween relay aion r 1 and r 2, if one aume a ingle buffer ype wih he ame delay characeriic a he regier, hen he Elmore delay from r 1 o r 2 i acually idenical o he Elmore delay from r 2 o r 1. Inering a buffer in our Muli- Clock domain problem formulaion acually mean requiring he inerion of wo buffer, one for each direcion of ignal flow. Le f denoe he MCFIFO elemen ha mu be inered on he rouing pah, T o be clock period before f and T o be he clock period afer f. Figure 10 how an example where here are wo clock period beween and he MCFIFO and wo clock period afer he MCFIFO. Since he clock have Fig. 11. yem. An example illuraing rouing wihin a muliple domain B r f. For a muli-clock domain ource-o-ink pah (an MCFIFO pah), we ue he following condiion for feaibiliy: an MCFIFO pah i feaible if and only if p v 1 whenever m v I, m v f for exacly one v V, he buffered pah delay beween conecuive regier beween and f i le han or equal o T Seup r, and he buffered pah delay beween conecuive regier beween f and i le han or equal o T Seup r. For example, Figure 11 how a oluion on he rouing graph wih a ingle MCFIFO wih laency T " 2T. Thu, he muliple clock domain, buffered rouing problem Our e of inerable elemen i now I become: Problem 2: Given a rouing graph G V E, he e I B r f, and wo node V, find a feaible MCFIFO pah from o uch ha he laency from o i minimized. One can adop a imilar framework a in RBP algorihm poenially inering an MCFIFO elemen for every candidae, wherever a regier i inered. We call hi new algorihm GALS for Globally Aynchronou, Locally Synchronou. There are everal modificaion o RBP ha mu be conidered o obain he GALS algorihm: A GALS candidae mu know if he MCFIFO ha been inered, o now we ue he ix-uple α c d b v z l

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 7 where z 0 if α doe no conain an MCFIFO and z 1 oherwie. Le T 0 T and T 1 T be a funcion which reurn he required clock period for a given z value. The laency l i dicued below. GALS pruning ake place only wih candidae wih he ame value of z. Two candidae wih oppoing value of z are no direcly compared for pruning. Inead of oring a ingle li of candidae for each grid node, now we ore wo li: one for z 0 and one for z 1. If we have no ye inered an MCFIFO ono he pah, pruning i done wih repec o he z 0 li, oherwie, i i done wih repec o he z 1 li. T, one canno find idenical elemen for wave fron expanion a eaily a in he ingle-clock domain cae. In RBP, he number of regier obviouly deermined he laency. For GALS, one four-regier pah may have laency 2T " 3T while anoher four-regier Becaue T pah ha laency T " 4T. Whichever pah ha maller laency mu be explored fir. Thu, he candidae value l ore he laency from he mo recenly inered regier or MCFIFO back o he ink. Ju like he RBP algorihm, d ill ore he combinaional delay from he curren node o he mo recenly inered regier or MCFIFO. The elemen in Q are ill ordered by d, bu he elemen in Q are ordered by l. We define he operaion Q ExracAllMin Q o pull all elemen off of Q wih he ame laency and load hem ino Q. Thi operaion exrac he nex wave fron of elemen wih equal laency from Q. In RBP, he fir regier inered a a grid node v preclude he need o iner regier a v for any oher pah. RBP ue A v 0 1 o repreen wheher a regier had been een in a pah a v. In GALS, we need o eparae he cae of inering a regier before f and afer f. Le A 0 v 0 1 repreen wheher a regier wa inered beween f and a v and A 1 v 0 1 repreen wheher a regier wa inered beween and f a v. Alo, le F v 0 1 denoe wheher an MCFIFO wa inered a v. Figure 12 give a emplae of he GALS algorihm. The main difference beween GALS and RBP i he addiion of Sep 9 for inering MCFIFO elemen. Ju like regier in RBP, GALS conider inering an MCFIFO a each poible inernal node a he wavefron expanion proceed. Oher difference include uing T z o look up he curren clock period conrain, reurning a oluion in Sep 4 only if i ha an MCFIFO, and he wave-fron queue mechanim of Sep 2. If N i he number of node ha can be reached in max T T, he ime complexiy of GALS i O nnk 2 lognk which i ame a he RBP algorihm. V. EXPERIMENTAL RESULTS We obained code for he Fa-Pah algorihm from he auhor of [17], hen implemened RBP and GALS uing hi framework. The code i wrien in C and wa run on a Sun Solari Enerprie 250. To perform he experimen, we ue GALS Algorihm G B m r f T T Inpu: G V E Rouing grid graph B Buffer library ource node ink node m iniial labeling wih m, m r regier for clocking ignal f MCFIFO elemen T required clock period from f o. T required clock period from o f. Var: Q prioriy queue keyed by d Q prioriy queue keyed by l α c d m v z l Candidae a v A 0 A 1 Marking of regiered node Oupu: m Labeling of complee - pah 1. Q C r Seup r m 0 0. Q /0, A 0 v A 1 v 0, v V 2. while ( Q /0 ) and ( Q /0 ) do if ( Q /0 ) hen Q ExracAllMin Q coninue. 3. c m b u z l exrac min Q 4. if u hen d d R m c K m if z 1 and d T hen reurn labeling m. 5. for each u v E do c c C u v d d R u v c C u v 2 if d T z hen puh c d m v z l ono Q and prune 6. if p u 1 and m u 0 hen 7. for each b B do c C b, m u b d d R b c + K(b) if d T z hen puh c d m u z l ono Q and prune 8. if A z u 0 and d R r c K r T z hen m u r, A z u 1 puh C r Seup r m u z l T z ono Q 9. if z 0, F u 0 and d R f c K f T z hen m u f, F u 1 puh C f Seup f m u 1 l T ono Q Fig. 12. The GALS Algorihm. eimaed parameer for a 0 07µ echnology a repored by Cong and Pan [8]. We ue a ingle buffer ize of 100 ime minimum gae widh, riple wide wire, and aume delay characeriic for he regier and MCFIFO o be idenical o ha of he buffer. A in [6], we ue a 25 by 25 mm chip and place he ource and ink 40 mm apar. Thee choice enure ha a ignifican number of clock cycle will be required o propagae a ignal from o. Our fir wo experimen focu on he RBP algorihm while he hird experimen focue on he GALS algorihm. A. Single Clock Domain wih Varying Period Our fir experimen inveigae he behavior of he RBP algorihm a a funcion of he clock period. Given a grid eparaion of 0 125mm and a grid ize of 200 x 200, we varied he number of regier ha can be placed along he pah ha i eparaed by 159 grid edge.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 8 TABLE I FASTREGPATH STATISTICS AS A FUNCTION OF T φ. MAXREGSEP/MINREGSEP REFERS TO THE MAX/MIN NUMBER OF GRID POINTS BETWEEN SUCCESSIVE REGISTERS. MAX R/B SEP REFERS TO THE MAXIMUM GRID SEPARATION BETWEEN A REGISTER OR A BUFFER AND THE FOLLOWING REGISTER OR BUFFER. ON THIS GRID, THE NUMBER OF GRID EDGES ALONG A SHORTEST PATH BETWEEN SINK AND SOURCE WAS 159. T φ Laency Regier Buffer MaxRegSep MinRegSep Max R/B Sep Min R/B Sep Config MaxQSize ime (p) (p) (ec) 27397 0 16 - - 19 18 1014896 5951 28.95 1371 2742 1 14 160 160 21 19 918078 19759 35.41 925 2775 2 14 108 104 27 17 881092 19512 34.84 686 2744 3 12 80 80 22 19 805603 13518 30.90 551 2755 4 10 64 64 23 20 755814 12558 29.55 463 2778 5 11 54 50 29 17 694386 9981 27.50 398 2786 6 7 46 44 26 18 638676 9265 25.46 343 2744 7 8 40 40 21 19 571877 7978 22.88 261 2871 10 10 30 20 20 14 468975 6193 19.02 84 3360 39 0 8 8 8 8 78122 1722 6.57 67 4288 63 0 5 5 5 5 78246 1098 6.59 62 4960 79 0 4 4 4 4 78278 876 6.63 53 8480 159 0 2 2 2 2 78360 442 6.55 49 15680 319 0 1 1 1 1 78416 312 6.44 The reul are ummarized in Table I. The fir daa row in he able (wih T φ ) preen he reul of running he Fa Pah algorihm, where he repored laency i acually he minimum-buffered pah delay. The oher row are he reul of running he RBP algorihm wih he indicaed clock period. 1 The clock period and laencie are repored in picoecond. The maximum and minimum regier eparaion i given, a well a he eparaion beween any inered elemen (i.e. a regier or a buffer followed by a conecuive regier or buffer). We make he following obervaion: A he clock period decreae, he number of regier along he pah increae while he number of buffer and he maximum and minimum regier eparaion all decreae. However, he maximum and minimum eparaion beween conecuive regier and buffer do no conienly decreae unle he clock period i o mall ha regier are inered every one or every oher grid poin. The number of configuraion inveigaed (i.e., candidae popped off he queue Q in Figure 5) decreae wih decreaing clock period. Thi empirically confirm ha he run ime complexiy of RBP become more efficien a he clock period decreae, becaue he pace of feaible wavefron expanion in a ingle cycle i reduced. Becaue RBP ha addiional overhead for regier inerion, only when he clock period drop below a cerain hrehold do we ee a run-ime improvemen over Fa Pah, e.g., in hi cae i i for T phi 463. B. Single Clock Domain wih Varying Grid Size Nex, we inveigae he behavior RBP a a funcion of he clock period and he grid eparaion. We experimened wih hree grid eparaion: 0.5mm,.25mm, and.125mm. We 1 The eemingly odd choice for he clock period are acually he fae clock period required o achieve he given number of regier (rounded o he neare picoecond). For example, a T φ of 686 i he fae clock period ha achieve a hree regier oluion. ummarize he reul in Table II. The fir daa row for each grid ize correpond o he reul of he Fa Pah algorihm, while he oher repreen he reul of running our RBP algorihm. We oberve he following: A he grid become more refined, Fa Pah laency improve lighly, from 2741 o 2739 p. The improvemen may be more ignifican when blockage along he pah are preen. I i poible o achieve a maller laency wih a more refined grid. For example, a a clock period of 925 and a 50! 50 grid, he laency i 3700, bu i i 2775 when we ue a 100! 100 grid. In ome cae, no improvemen were poible uch a for clock period 67 and 62. Wih a coare grid and a very mall clock period, i i impoible o find a rouing oluion a he grid eparaion demand placing regier le han one grid edge apar. The finer grid allow placing he regier cloer. No oluion for example a fond A clock period 53 and 49 for grid ize 50! 50, and for clock period 49 for grid ize 100! 100. Wih larger clock period, i i poible o achieve a laency cloe o he opimal buffered-pah delay. For example, uing a 200! 200 grid, a all clock period hown above 84p i i poible o be wihin one clock period from he opimal pah delay of 2739. C. GALS for Muliple Clock Domain Our final experimen explore he behavior of he GALS algorihm for differen period of he clock domain. Given our new problem aemen, comparing o Fa Pah i no poible. We imply illurae reul of he echnique. We ran GALS on he ame e cae in he previou experimen uing a grid eparaion of 0.125 mm. Table III repor he number of buffer inered, he number of regier on he ink ide of he MCFIFO (Reg-), he number of regier on he ource ide of he MCFIFO (Reg-) and he laency. The relaive value of Reg- and Reg- indicae

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 9 TABLE II RBP PERFORMANCE AS A FUNCTION OF CLOCK PERIOD AND GRID SIZE. MAX. (MIN.) SEPARATION REFERS TO THE MAXIMUM (MINIMUM) BUFFER SEPARATION WHEN THE CLOCK PERIOD IS, AND TO THE CORRESPONDING REGISTER SEPARATION OTHERWISE. Grid Separaion: 0 5mm:50 50grid. Period(p) 1371 925 686 551 463 398 343 261 84 67 62 53 49 Regier - 1 3 3 5 6 7 7 11 39 79 79 - - Buffer 15 14 12 12 10 6 7 8 0 0 0 0 - - Laency 2741 2742 3700 2744 3306 3241 3184 2744 3132 3360 5360 4960 - - Max. Sep. 5 40 26 20 15 13 11 10 7 2 1 1 - - Min. Sep. 5 40 2 20 5 2 3 10 3 2 1 1 - - ime() 0.41 0.70 0.76 0.69 0.73 0.70 0.68 0.61 0.59 0.42 0.38 0.36 - - Grid Separaion: 0 25mm:100 100grid. Period 1371 925 686 551 463 398 343 261 84 67 62 53 49 Regier - 1 2 3 4 5 7 7 10 39 79 79 159 - Buffer 16 14 14 12 10 11 7 8 10 0 0 0 0 - Laency 2740 2742 2775 2744 2755 2778 3184 2744 2871 3360 5360 4960 8480 - Max. Sep. 10 80 54 40 32 27 22 20 15 4 2 2 1 - Min. Sep. 9 80 52 40 32 25 6 20 10 4 2 2 1 - ime() 3.77 5.63 5.52 5.10 4.78 4.45 4.33 3.69 3.08 1.63 1.69 1.61 1.63 - Grid Separaion: 0 125mm:200 200grid. Period 1371 925 686 551 463 398 343 261 84 67 62 53 49 Regier - 1 2 3 4 5 6 7 10 39 63 79 159 319 Buffer 16 14 14 12 10 11 7 8 10 0 0 0 0 0 Laency 2739 2742 2775 2744 2755 2778 2786 2744 2871 3360 4288 4960 8480 15680 Max. Sep. 19 160 108 80 64 54 46 40 30 8 5 4 2 1 Min. Sep. 18 160 104 80 64 50 44 40 20 8 5 4 2 1 ime() 28.95 35.41 34.84 30.90 29.55 27.50 25.46 22.88 19.02 6.57 6.59 6.63 6.55 6.44 wheher he MCFIFO wa placed cloe o he ource or o he ink. For example, when T T 300, he algorihm place he MCFIFO cloe o he ource, bu when T 200 i place i cloer o he ink. Thu, we canno generalize he behavior on he opimal locaion of he MCFIFO, i depend on he blockage map, clock period T and T and he echnology parameer. For all cae, we oberve ha he oal laency i no ignificanly higher han he minimum ource-ink delay of 2739 p (from Table I). TABLE III GALS STATISTICS AS A FUNCTION OF T AND T WITH A GRID SEPARATION OF 0.125 MM. T 300 200 300 300 400 250 300 T 300 300 200 400 300 300 250 Buffer 9 2 2 8 8 7 6 Reg- 8 1 10 3 3 6 2 Reg- 0 10 1 3 3 2 6 laency 3000 2800 2800 2800 2800 2850 2850 VI. CONCLUDING REMARKS Auomaed buffered rouing i a neceiy in modern VLSI deign. The conribuion of hi paper are wo new problem formulaion for buffered rouing for ingle and muliple clock domain. Boh of hee formulaion addre problem ha will become more prominen in fuure deign. Any CAD ool currenly performing buffer inerion will evenually have o deal wih ynchronizer inerion. Furhermore, any SoC rouing CAD ool will have o handle rouing acro muliple clock domain due o he increaing ue of IP. We olve boh problem opimally in polynomial ime via he RBP and GALS algorihm ha build upon he Fa Pah algorihm of [17]. Experimenal reul validae he correcne and pracicaliy of he wo algorihm for an aggreive echnology. ACKNOWLEDGMENTS The auhor are graeful o Hai Zhou for upplying Fa Pah code, and alo o Meera Thiagarajan for help wih he figure and reearching he background maerial on he MCFIFO. REFERENCES [1] C. J. Alper, G. Gandham, J. Hu, S. T. Quay J. L. Neve, and S. S. Sapnekar. STeiner Tree Opimizaion for Buffer and Blockage and Bay. IEEE Tranacion on Compuer-Aided Deign, 20(4):556 562, April 2001. [2] L Carloni, K. McMillan, A. Saldanha, and A. Sangiovanni-Vincenelli. A Mehodology for Correc-by-Conrucion Laency Ineniive Deign. In Proc. of he IEEE Inernaional Conference on Compuer-Aided Deign (ICCAD), 1999. [3] D. Chapiro. Globally Aynchronou Locally Aynchronou Syem. PhD hei, Sanford Univeriy, 1984. [4] T. Chelcea and S. Nowick. Robu Inerface for Mixed-Timing Syem wih Applicaion o Laency-Ineniive Proocol. In Proc. of he ACM/IEEE Deign Auomaion Conference (DAC), page 21 6, 2001. [5] P. Cocchini. Concurren Flip-Flop and Repeaer Inerion for High Performance Inegraed Circui. In Proc. of he IEEE Inernaional Conference on Compuer-Aided Deign (ICCAD), page 268 73, 2002. [6] J. Cong. Timing Cloure Baed on Phyical Hierarchy. In Proceeding of he Inernaional Sympoium on Phyical Deign, page 170 174, 2002. [7] J. Cong, J. Fang, and K.-Y. Khoo. An Implici Connecion Graph Maze Rouing Algorihm for ECO Rouing. In Proc. of he IEEE Inernaional Conference on Compuer-Aided Deign (ICCAD), page 163 167, 1999. [8] J. Cong and Z. Pan. Inerconnec Performance Eimaion Model for Deign Planning. IEEE Tranacion on Compuer-Aided Deign, 20(6):739 752, June 2001. [9] S. Haoun. Opimal Ue of 2-Phae Tranparen Lache in Buffered Maze Rouing. In Proc. of he IEEE Inernaional Sympoium on Circui and Syem, 2003. [10] A. Hemani, T. Meincke, S. Kumar, A. Poula, T. Olon, P. Nilon, J. Ober, P. Ellervee, and D. Lundqvi. Lowering Power Conumpion in Clock by Uing Globally Aynchronou Locally Synchronou Deign Syle. In Proc. of he ACM/IEEE Deign Auomaion Conference (DAC), 1999.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, TO APPEAR. 10 [11] J. Muerbach, T. Villiger, H. Kaelin, N. Felber, and W. Fichner. Globally-aynchronou locally-ynchronou archiecure o implify he deign of on-chip yem. In Twelfh Annual IEEE Inernaional ASIC/SOC Conference, 1999. [12] M. Lai and D. F. Wong. Maze Rouing wih Buffer Inerion and Wireizing. In Proc. of he ACM/IEEE Deign Auomaion Conference (DAC), page 374 378, 2000. [13] R. Lu, G. Zhong, C. Koh,, and K. Chao. Flip-Flop and Repeaer Inerion for Early Inerconnec Planning. In Proc. of Deign Auomaion and Te in Europe Conference (DATE), page 690 5, 2002. [14] J. Rabaey. Digial Inegraed Circui. Prenice Hall, 1996. [15] J-N. Seizovic. Pipeline Synchronizaion. In IEEE ASYNC, 1994. [16] L.P.P.P. van Ginneken. Nework for Minimal Elmore Delay. In Proc. of he IEEE Inernaional Sympoium on Circui and Syem, page 865 8, 1990. [17] H. Zhou, D. F. Wong, I.-M. Liu, and A. Aziz. Simulaneou Rouing and Buffer Inerion wih Rericion on Buffer Locaion. IEEE Tranacion on Compuer-Aided Deign, 19(7):819 824, July 2000.