Hardware Implementation of Improved Adaptive NoC Rer with Flit Flow History based Load Balancing Selection Strategy Parag Parandkar 1, Sumant Katiyal 2, Geetesh Kwatra 3 1,3 Research Scholar, School of Electronics, Devi Ahilya University, Indore, M. P., India 2 Professor, School of Electronics, Devi Ahilya University, Indore, M. P., India Abstract To improve load balancing in NoCs several techniques exists in literature like Regional Congestion Awareness (RCA) and similar techniques. Also there are some techniques based on put port selection like count of free virtual channels, count of fluid buffers, buffer occupancy time at reachable downstream neighbors and flit flow history based algorithm named as Tracker. Among these techniques, Tracker has performed significantly better than others. However, Tracker has been simulated and verified using NoC simulation tool and no hardware implementation of flit flow based algorithm exists in literature yet. The proposed work is anew in the regard that no hardware implementation of Tracker architecture have been seen till now in the research literature of Network on chip. It implements improved flit flow based technique used by Tracker implemented on programmable hardware (Xilinx Virtex-5 FPGA) and achieves significant frequency of 686.81 MHz as validated by experimental synthesized results. The innovation in the existing architecture was brought ab by insertion of additional buffers in the tracker internal logic to achieve better area / performance trade off for chip multiprocessors. Keywords NoC, Tracker, Adaptive ring, MOE, Load balancing ring, Virtual Channel rer, Verilog, Virtex FPGA. I. INTRODUCTION Since the last decade, optimizing computational efficiency of intellectual property cores had been the preferred choice in technological innovations among System on Chips (SoCs). At the same time, reliable and efficient communication also has to be given more emphasis as far as achieving important metrics, high performance and throughput among SoCs are concerned. This is due to the fact of wire delays getting comparable with the gate delays with continually decreasing feature sizes [1]. Network on chip considered as efficient replacements of other form of interconnects in chip multiprocessors and system on chip designs [1, 3]. Network on Chip (NoC) architecture consists of combination of topology, ring algorithms, switching, power optimization etc. Among different topologies, mesh topology considered as the most competent topology. 262 In the 2D mesh topology, the processing cores are arranged as rectangular tiles. Each processing IP core is connected to a local rer, which in turn connects them together in the form of mesh arrangement by interconnecting with the other similarly connected rers. The communication among the IP cores take place in the units of packets, flits and phits at different levels of abstraction. Among the three switching techniques store and forward, wormhole and virtual cut through switching, the wormhole switching is mostly preferred because of its less buffer space utilisation, in turn utilizing lesser area and power requirements. In this switching, each packet is sub divided into sequence of flow control units called as flits. The control information is driven by the header flit. The rer of a 2D mesh topology contains five bidirectional ports, four ports for each of the directions, east, west, north and sh and one for the local tile (IP Core). Every input port of a rer could optionally be associated with flits buffers set, termed as virtual channels (VCs). The use of virtual channels is subjected to the choice of establishing balance between saving on average network latency for the cost of area and power consumption of added buffers [5]. The available buffer space information among the rers is being carried by control signals. Flits belonging to several VCs within the same input port arbitrate among themselves and successful flits from various input ports will undergo switch arbitration and allocation and finally flits are forwarded through the crossbar switch to the respective put ports [2]. An existing NoC architecture mostly has a choice among deterministic, oblivious and adaptive ring algorithms to determine re taken by a packet to its destination. Although adaptive ring imbibes more complex implementation but it is still preferred because of better fault tolerance, increased network throughput and decreased network latency as compared to oblivious policies when non-uniform or busty traffic is applied to it. Despite of these pluses performance of adaptive ring becomes detrimental when local decisions based on network load are taken, resulting in disturbing load balancing within the NoC.
This may imply undesired infusion of hot spots. Also adaptive rers choose the best re for the incoming packets among the available res by the choice of having chosen dynamically varying network congestion status among other selection metrics. For the unrestricted flow of packets within the rers buffers along with virtual channel is utilized. At the same time, handshaking mechanisms using set of control signals are used for maintaining information synchronization among the rers. Thus computational units like that of IP cores establish efficient communication among themselves by producing, and processing data packets and control signals through the NoC infrastructure. Adaptive ring chooses the best re for incoming packets from a set of available res by making use of a proper selection metric that captures dynamically varying congestion status [3]. Recently designed Network on chip architectures generally requires crucial parameters like low latency, load balancing and deadlock free ring to be satisfied to the best of the extent in order to maintain optimized ring implementation. At the higher packet injection rates, the capability to handle packets flow among the neighboring rers and number of allocated virtual channel buffers pose biggest limitation. The adaptivity in the ring mechanism for a typical adaptive rer gets feasible when it chooses the best possible put port for an incoming bit. For this, the put port selection function chooses one of the put ports for the flit by choosing an appropriate metric that takes care of congestion [6]. The selection metric which was preferred to represent congestion will decide the optimum selection strategy. Link utilization of the network can be improved by balancing the traffic across all the links. Flit flow history based analysis method as proposed by Tracker [2] was obvious choice made over the existing metrics like availability of free virtual channels [5, 6], buffer fluidity values [10] and buffer occupancy values [11]. Ring decisions are taken in such a fashion that less frequently used links are preferred. II. RELATED WORK The tracker algorithm utilized the minimum odd-even ring (MOE) which is one of the simplest and most commonly used deadlock free adaptive ring algorithms used in mesh NoCs [4]. The MOE algorithm makes random selection of the available ports. The selection functions have to work above MOE to implement ring function. Load information of neighboring switches for channel selection is depicted in [7]. Load balancing ring scheme by random channel selection is proposed in [9]. Based on the past flow pattern, the author in [8] estimates network s congestion level and deterministically calculates optimized ring paths for all traffic flows. Count of free Virtual channel is also explored as a selection metric in the adjacent subsequent rers [5, 12]. Free VC status of reachable neighbors of adjacent rers of current node is investigated by [6]. Count of number of fluid buffers is explored in [10]. History of buffer occupancy within realistic time interval is discussed in [11]. Tracker architecture included a Virtual channel rer [3] that monitors flow of flits through all its going ports and exchanges this flit flow information with its neighbors. Computation of flit flow information is done using Cumulative Flit Count(CFC). It designates contention level of an put port of neighboring rer. The architecture of the selection logic of tracker rer which was implemented in Tracker is as shown in Fig.1. West From rer 4 To rer 4 Status register of node 5 Sh Fig 1. Internal architecture of internal logic in Tracker [2] Working of flit forwarding in a tracker rer is as shown in the following Fig. 2. Flit F is sourced from node 4 and is destined to node 15. It first reaches at node 5. For flit F, the MOE ring function [3] chooses east port (link to node 6) and north port (link to node 9) as the possible put ports. At the same time, flip flop () values from node 9 and node 6, reach node 5. East 263
12 8 13 9 14 10 15 D 11 West in East in To Rer 4 4 S 5 6 7 Status register of node 5 East 0 1 2 3 West in From Rer 4 Fig. 2: Rer Architecture Network Model III. PROPOSED IMPLEMENTATION The proposed architecture of improved tracker design is as shown in Fig. 3. The additional buffers are inserted in the existing tracker internal logic to achieve better area / performance trade off for chip multiprocessors. The rer network architecture model (Fig. 2) as designed in [2] is implemented using the improved internal architecture. 16 nodes are designed and configured in a 4 x4 mesh network model and tested the Tracker (node) behavior among the network. Each node is said to be a test_node and using the array of 16 such nodes connected together, a typical NoC architecture is designed. The modified rer architecture network model is designed in verilog HDL by module name rer_top. It consists of combination of following group of inputs and put signals: Inputs: a) clk, reset of 16 bits, b) west_data0_in, sh_data0_in, sh_data1_in, sh_data2_in, sh_data3_in, east_data3_in, east_data7_in, east_data11_in, east_data15_in, north_data15_in, north_data14_in, north_data13_in, north_data12_in, west_data12_in, west_data8_in, west_data4_in of 8 bits. Fig. 3: Improved internal architecture of internal logic of tracker c) req0_w, req4_w, req8_w, req12_w, req12_n, req13_n, req14_n, req15_n, req15_e, req11_e, req7_e, req3_e, req0_s, req1_s, req2_s, req3_s of 1 bit. d) busy0, busy1, busy2, busy3, busy4, busy5, busy6, busy7, busy8, busy9, busy10, busy11, busy12, busy13, busy14, busy15 of 4 bits. Outputs: Sh Sh in west_0_, sh_0_, sh_1_, sh_2_, sh_3_, east_3_, east_7_, east_11_, east_15_, north_15_, north_14_, north_13_, north_12_, west_12_, west_8_, west_4_ of 8 bits. Among the inputs, the a) group contains clk and 16 bit resets, one reset for each test_node. The b) group consists of all the inputs, each of 8 bits pertaining to each of the test_nodes from 0 to 15, in the 4 x 4 mesh network model. The c) group consists of collection of control signals in terms of request signals of neighboring nodes for data transfer. Each of the request signals are of 1 bit size. The d) group contains the status of the put port of a test_node in terms of busy signal. 264
Each of the test_node is associated with 4 bits busy signal corresponding to the directions north, sh, east and west. The busy signal will be put on as 0 provided the tracker algorithm working within test_node validates the put path availability corresponding to the referred direction. Among the puts, the port directions of ermost test_nodes each having data ports of 8 bits are depicted. Test_Node: The test node is as shown in Fig. 4. It is the basic building block of the NoC architecture network model. It contains the information related to and required by all the neighborhoods surrounding it. A typical test_ node has four input data ports each of 8 bits named as north_data, sh_data, east_data, west_data corresponding to each direction north, sh, east and west respectively along with the associated buffers. It also has 4 bits request input coming from the incoming nodes. nsew_busy signal of 4 bits, gives the desired direction of propagation of flit as proposed by the MOE algorithm running within the test_node. Test_node has four put data ports each of 8 bits named as north_, sh_, east_ and west_ corresponding to each direction north, sh, east and west respectively. To forward the requests towards the downstream there is a signal of four bits o_nsew_req and to respond to the requests acknowledgements are sent in terms of 4 bits nsew_ack signals. One path is selected and configured for only one side receive and one side put. For example: if the node 5 needs to receive the data in the west side and puts the data to north, then the preferable settings are designed such that west node request is 1 and north side busy signal should be 0. north_data sh_data east_data west_data nsew_req nsew_busy Test_Node Fig. 4: Test_node north_ sh_ east_ west_ o_nsew_req nsew_ack Inputs : north_data, sh_data, east_data, west_data are of size 8 bits and nsew_req, nsew_busy are of size 4 bits. Outputs: north_, sh_, east_, west_ are of 8 bits and o_nsew_req, nsew_ack are of size 4 bits. According to the top level bit configuration, each node can receive and send the value. Test_wrapper: The above 4x4 network model is given data inputs, along with clock and resets and the data puts are observed by designing a top level module, named as test_wrapper, which encompasses rer_top module. It has clk, 16-bits reset, 8-bits data_in and 8-bits data_. It is as shown in Fig. 5. clk reset data_in Test_wrapper Fig. 5: Test_wrapper Data_ The rer_top module is instantiated within it and data is sent from the west direction of test_node 4, according to minimal odd-even ring (MOE) algorithm and red by applying the improved tracker algorithm pertaining to the architecture as shown in Fig. 3. The neighborhood nodes are investigated and counts for Cumulative Flit Count (CFC) for them is observed, so as to reach to test_node 15. The code is tested for one path (configuration bit is generated according to MOE algorithm). As depicted by dark arrows in Fig. 2, the data is red from the west port of test_node 4, then to the west of test_node 5, then to the north of test_node 9, then to the east of test_node 10, then to the east of test_node 11 and finally to the north of test_node 15. The data will have the options of diverting to the east of the node 5, but due to turn model restrictions of the MOE algorithm, it could not take turn to north from node 6, to node 10, thereby follows the most conveniently available path towards north of node 9. The test_wrapper is functionally verified by applying a test bench. After resetting initially for the first clock cycle the desirable response appears as shown in the simulation result as depicted in Fig. 6. The code of test_wrapper has been synthesized on Xilinx Virtex-5 FPGA and results in the maximum clock frequency of 686.81 MHz are shown in results section. 265
IV. RESULTS AND DISCUSSIONS The simulation result is as shown by Figure 6. As per the simulation it took around 23 clock cycles to reach the data, which was put at the west input of source test_node 4 to reach at the east port of destination test_node 15. Upon synthesizing on Xilinx Virtex-5 (XCVvlx50t- 2ff1136), the maximum clock frequency of 686.81 MHz is obtained. Fig. 6: Functional Simulation of test_wrapper Fig. 7: Synthesized top level block Test wrapper V. CONCLUSION There are some techniques based on put port selection like count of free virtual channels, count of fluid buffers, buffer occupancy time at reachable downstream neighbors and flit flow history based algorithm named as Tracker. Among these techniques, Tracker has performed significantly better than all others. Hardware implementation of improved adaptive NoC rer with flit flow history based load balancing selection strategy, has been proposed. The proposed technique is an improved version of the Tracker architecture which incorporates insertion of additional buffers in the existing tracker internal logic to achieve better area / performance trade off for chip multiprocessors. The proposed work implements improved flit flow history based technique used by Tracker implemented on programmable hardware (Xilinx Virtex-5 FPGA) using Verilog HDL and achieves significant frequency of 686.81 MHz as validated by experimental synthesized results. REFERENCES [1] W. Dally and B. Towles, Re packets, not wires: On-chip interconnection networks, in DAC, pp. 684-689, 2001. [2] John Jose, K.V. Mahathi, J. Shiva Shankar and Madhu Mutyam, TRACKER: A Low Overhead Adaptive NoC Rer with Load Balancing Selection Strategy, IEEE/ACM International Conference on Computer-Aided Design (ICCAD) Nov. 5-8, 2012, San Jose, California, USA. [3] W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., USA, 2003. [4] G. M. Chiu, The odd-even turn model for adaptive ring, IEEE TPDS, vol. 11, no. 7, pp. 729-738, 2000. [5] W. Dally, Virtual-channel flow control, IEEE TPDS, vol. 3, no. 2, pp. 194-205, 1992. [6] G. Ascia, et al., Implementation and analysis of a new selection strategy for adaptive ring in NoC, IEEE TOC, vol. 57, no. 6, pp. 809-820, 2008. [7] E. Nilsson, et al., Load distribution with the proximity congestion awareness in a network-on-chip, in DATE, pp. 1126-1127, 2003. [8] A. E. Kiasari, et al., A framework for designing congestionaware deterministic ring, in NoCArc, pp. 45-50, 2010. [9] M.H.Cho, et al. Path-based, randomized, oblivious, minimal ring, in NoCArc, pp. 23-28, 2009. [10] Y. C. Lan, et al., Fluidity concept for NoC: A congestion avoidance and relief ring scheme, in SoC, pp. 65-70, 2008. [11] J. Jose, et al., BOFAR : Bu_er occupancy factor based adaptive rer for mesh NoCs, in NoCArc, pp. 23-28, 2011. [12] J. Kim, et al., A low latency rer supporting adaptivity for on- chip interconnects, in DAC, pp. 559-564, 2005. 266