Quality-of-service and error control techniques for mesh-based network-on-chip architectures

Transcription

1 INTEGRATION, the VLSI journal 38 (25) Quality-of-service and error control techniques for mesh-based network-on-chip architectures Praveen Vellanki, Nilanjan Banerjee, Karam S. Chatha Department of CSE, Arizona State University, P.O. BOX 87546, Tempe, AZ , USA Received 16 June 24; received in revised form 19 July 24; accepted 21 July 24 Abstract Network-on-a-chip (NoC) has been proposed as a solution for addressing the design challenges of future high-performance system-on-chip architectures in the nanoscale regime. Many real-time applications require input data that arrives with low delay jitter. Such communication traffic can only be supported by incorporating multiple levels of service in the interconnection network. Further, as technology scales toward deep submicron, on-chip interconnects are becoming more and more sensitive to noise sources such as power supply noise, crosstalk, and radiation induced effects, that are likely to reduce the reliability of data. Hence, effective error control schemes are required for ensuring data integrity. This paper addresses two important aspects of NoC architectures, quality of service and error control schemes and makes the following contributions: (i) it presents techniques for supporting guaranteed throughput (for low delay jitter traffic) and best-effort traffic quality levels in NoC router, (ii) it presents architectures for integrating error control schemes in the NoC router architecture, and (iii) it presents cycle accurate power and performance models of the two architecture enhancements for a mesh based NoC architecture. r 24 Elsevier B.V. All rights reserved. Keywords: Network-on-chip; Quality-of-service, Error-control; Power consumption; Performance Corresponding author. Department of Computer Science and Engineering, Arizona State University, Brickyard Suite 51, 699 South Mill Avenue, Tempe, AZ85281, USA. Tel.: ; fax: addresses: [email protected] (P. Vellanki), [email protected] (N. Banerjee), [email protected] (K.S. Chatha) /$ - see front matter r 24 Elsevier B.V. All rights reserved. doi:1.116/j.vlsi

2 354 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Introduction The physical characteristics of nanoscale technologies will pose several challenges to the systemon-chip (SoC) designers. Global signal delays will span multiple clock cycles [1,2]. Signal integrity will also be compromised due to increased RC effects, inductance, and cross-coupling capacitances [3]. Nanoscale packet switched networks or network-on-chip (NoC) have been proposed as architectural solution for SoC design in the nanoscale regime [4 8]. Packet switching supports asynchronous transfer of information. It provides extremely high bandwidth by distributing the propagation delay across multiple switches, thus pipelining the signal transmission. Packet switching networks also support error detection and correction schemes that can be applied towards improving the signal integrity. Quality of service (QoS) can be ensured by distinguishing between different types of traffic. In this paper, we present techniques for supporting multiple levels of service and error control schemes for a mesh based NoC architecture. Fig. 1 plots the variation in packet latency for destinations that are uniformly 3 hops away in a 4 4 mesh based NoC architecture for a router with 4 virtual channels at an injection rate of.5 packets/cycle/node. The x-axis denotes the latency of various packets, and y-axis denotes the number of packets. The mean latency of the plot is clock cycles which is close to the peak of the plot. However, there are a large number of packets (214, over 5%) that experience transmission latency that is more than double the average latency. Such a large variance in average latency is unacceptable for many NoC implementations such as traffic between a cache and lower level memory, or different processing elements of a multimedia application. We present techniques for supporting both low jitter guaranteed throughput and best-effort traffic in a NoC router. Cycle accurate power and performance models for trade-off analysis of the two techniques are also presented. In the nanoscale regime, crosstalk on long global wires will be a major source of errors. Switching activity on aggressor links can cause errors by either forcing a logic transition 4 35 BE 3 Noof Packets Latency (cycles) Fig. 1. Variation in packet latency.

3 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) on a stable victim link or by delaying the transition on a switching victim link. Both these instances result in capture of an incorrect logic level at the receiver. A number of error control schemes [9] have been proposed for general communication networks. In a NoC architecture, due to the stringent performance and power constraints, low complexity and low power error control schemes are desirable. Hence, we have implemented two low overhead error control: single error detection and retransmission (PAR), and single error correction (SEC). We also present power and performance trade-offs of the two schemes under variable traffic profile. The trade-off in performance versus power consumption of interconnection network is a key question. The performance of the nanoscale interconnection network can be specified by the average latency of sending a message through the network, and the bandwidth of the network. The power consumption of the network consists of the dynamic and leakage power consumption of the various components. This paper also presents results for power versus performance trade-off analysis for different service levels of traffic and error control schemes. We integrated the QoS and error control schemes into a VHDL based cycle accurate power and performance model of NoC architecture. The model is a parameterized register transfer level (RTL) design of the NoC architecture elements. The design is parameterized on (i) size of packets, (ii) length and width of physical links, (iii) number, and depth of virtual channels, and (iv) switching technique. The model is annotated with delay, dynamic, and leakage energy estimates of the various components. The model can estimate the latency, throughput, dynamic, and leakage power consumption of a NoC architecture. The RTL design for the QoS and error control circuitry was synthesized and the SPICE level netlist was extracted from the layout. The design was then characterized for delay, and dynamic and leakage power consumption at :18 mm: The characterized values were integrated into the VHDL based RTL design to build the cycle accurate performance model. The paper is organized as follows: Section 2 discusses the previous work, Section 3 gives a quick overview of the NoC architecture and the cycle accurate performance model, 4 discusses the QoS schemes, 5 discusses error control techniques, Section 6 discusses the packet format and protocol, Section 7 presents the experimental results, and Section 8 concludes the paper. 2. Previous work In recent years a number of researchers have proposed architectures, performance evaluation techniques and optimization approaches for NoC. This section classifies and presents the existing research under four categories: seminal work, router architectures, performance models, and automated optimization approaches. Our paper discusses innovative router architectures for supporting guaranteed throughput, and error control schemes in mesh based on-chip interconnection networks, and presents power and performance evaluation models for the same. The work presented in our paper can be classified under both router architectures, and performance models. Hence, in the following section we compare and contrast our work with existing techniques in both categories.

4 356 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Seminal work Guerrier et al. [4] presented a NoC design called SPIN that was based on fat-tree topology. They also presented the router architecture and cycle accurate performance model for their NoC design. Sgroi et al. [5] discussed a platform based SoC design methodology that proposed the inclusion of NoC for supporting on-chip communication. Dally et al. [6] demonstrated the feasibility of the NoC and estimated that the NoC places an area overhead of 6.6%. Benini et al. [7] in their conceptual paper on NoC, predict that packet switched on-chip interconnection networks will be essential to address the complexity of future SoC designs. Kumar et al. [8] presented a conceptual system-level architecture that allowed a mesh-based NoC to accommodate large resources such as memory banks, FPGA areas, or high performance multi-processors. Except for Guerrier et al. [4] all the above mentioned works did not present detailed architectures or performance models. We will address [4] in more detail when we discuss NoC architectures and performance models NoC architectures Several researchers have proposed architectures, and related optimizations for on-chip interconnection networks. We classify the related research on NoC architectures based on the supported levels of traffic service classes, error control schemes, and power optimizations Architectures for best effort traffic In this paragraph we review the NoC architectures that support only best effort traffic class. SPIN [4,1,11] was one of the seminal works to propose a detailed NoC architecture built with fat tree topology. Proteo [12,13] is a VSIA-complaint NoC architecture that can be configured for ring, star, and bus topologies. Xpipes [14] is a parameterized router architecture that can be utilized in arbitrary NoC topologies. As shown in Fig. 1, best effort traffic class is limited by large deviation in average latency which is not desirable for many real-time applications. In this paper we present a technique for supporting low jitter guaranteed throughput traffic Architectures for guaranteed throughput traffic Nostrum [15,16] is a protocol stack for mesh based NoC architecture that supports both best effort and guaranteed throughput traffic classes. Nostrum ensures bandwidth for guaranteed throughput traffic by reserving time slots called looped containers for its transmission on interrouter links. If no guaranteed throughput traffic is injected into the network the time slots are not utilized. In contrast we support guaranteed throughput traffic by reserving a certain number of virtual channels (buffers). Hence, if no guaranteed throughput traffic is injected into the network the best effort traffic can be transported with maximum bandwidth. AEthereal [17,18] is also a mesh based NoC architecture that supports guaranteed throughput traffic by utilizing a centralized scheduler for allocation of link bandwidth. Our architecture utilizes a distributed scheme where the traffic producer sets-up a guaranteed throughput connection by reserving virtual channels, transfers the data, and then tears down the connection by giving up the virtual channels. Finally, neither of these two works presented detailed results for performance and power consumption of their respective architectures.

5 Architectures with error control schemes Bertozziet al. [19] presented power versus performance results for point-to-point error control in an on-chip bus protocol based on AMBA bus. Their work did not address NoC architectures, and did not consider the influence of network traffic on the performance of the error control schemes. Zimmer et al. [2] presented a fault model for NoC architecture. They also proposed a QoS scheme that treated control traffic with higher reliability than data traffic. In contrast, our paper presents a QoS scheme for guaranteed throughput and best-effort traffic. The performance and power consumption of the error control schemes in the presence of variable traffic profiles for a mesh-based NoC architecture have also been discussed Architectural optimizations for low power Worm et al. [21] proposed an adaptive low power transmission scheme for NoC that minimized the voltage swing and frequency subject to the workload requirement. Chen et al. [22] proposed power-aware buffer policy that minimized the leakage power consumption in virtual channels. Simunic et al. [23] proposed a system-level power reduction scheme for SoC architectures with onchip interconnection networks. Their scheme applied dynamic voltage management and dynamic voltage scaling policies based on both local and global workload information. Our work is focused on architecture extensions and performance models for supporting guaranteed throughput and error control schemes Performance evaluation Innovative performance evaluation models are required to address the design challenges of NoC based interconnection architectures. Although there are a number of models for network performance evaluation [24 27], these models do not consider the power consumption characteristics. Current system level performance evaluation tools [28 3] are targeted towards shared bus architectures and do not consider interconnection networks. Traditional solutions for on-chip global communication include models for various shared-bus [31 33] and ad hoc pointto-point interconnections. Wassal et al. [34] proposed system-level performance and power models for a shared-memory internet protocol/asynchronous transfer mode switching fabric. Ye et al. [35] analyzed the power consumption in the switch fabrics of network routers and proposed system-level models for the same. Pamunuwa et al. [36] performed a system level analysis and estimated the wiring overhead and the gate count for implementing mesh-based NoC architecture. They also estimated the power consumption by assuming switching activity on 5% of the gates. Wang et al. [37] proposed a power-performance simulator for interconnection network called Orion. All these models do not incorporate the QoS and error control schemes. Bolotin et al. [38] proposed analytical models for system-level performance and cost estimation of NoC architectures. They did not address the power consumption in NoC Automated design techniques P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) In the recent past researchers have begin to address the problem of synthesizing custom NoC architectures, and mapping communication traffic on them. Pinto et al. [39] presented a quadratic

6 358 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) programming based approach for synthesis of custom NoC architectures. Hu et al. [4] presented an integrated task and communication scheduling approach for mapping applications on meshbased NoC architectures. Murali et al. [41] presented a technique for bandwidth constrained mapping of cores to mesh based NoC architectures. As opposed to the synthesis techniques, this paper focuses on architectural extensions and performance models. 3. NoC architecture and characterization In the following paragraphs we describe the architecture of the various NoC elements (physical links, routers), and the techniques applied for their characterization Physical links The physical links include the data and control wires for communication between two router elements of the interconnection network Characterization of physical links The power and performance of a physical link is determined by its width (number of bits of data and control signals), length, and capacitive load of the router. In nanoscale technologies, individual wires are modeled by distributed RLC expressions for accurate description of their physical characteristics [42]. The RLC and cross-coupling capacitances of the interconnection model were obtained from the Berkeley Predictive Technology Model website [43]. We characterized the links in sets of three, two and single wire, respectively for :18 mm technology. The three and two wire sets included the distributed RLC effects and cross-coupling capacitances, while the single wire model only included the distributed RLC effects. We considered three different types of links: local ðp1 mmþ; intermediate ð41 mm and ðp4 mmþ; and global ð44 mmþ [1]. We obtained energy values for 64 ð8 8Þ; 16 ð4 4Þ and 4 ð2 2Þ different switching combinations for the three, two and single wire sets, respectively. The wire lengths were incremented in steps of 1 mm up to 1 mm; steps of 5 mm up to 4 mm and steps of 1 mm up to 5 mm: Table 1 summarizes the switching energy consumed in :18 mm technology for three wire-set switching for 1, 1 and 5 mm; respectively Performance evaluation of physical links We included the link characterization values as a table in our performance model. The energy consumed by a n-bit wide link can be calculated from the energy consumed by the three, two and single wire sets of similar length. For example, consider the 9-bit (odd) wide link shown in the lefthand side of Fig. 2. The total switching energy consumed by the links can be calculated by adding the switching energy consumed by the three wire sets S, S1, S2 and S3, and subtracting the energy consumed by single wire links A, B, and C, respectively. In the case of a 8-bit (even) wide link shown in the right-hand side of Fig. 2, the energy consumed by two wire set S3 is included in the calculation. The length of the physical link which is a major factor in determining its power consumption and performance is specified by the designer.

7 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Table 1 3 wire-set characterization Switching Energy (in fj) 1 mm 1 mm 5 mm (-), (1-1), (1-1), (11-11), (1-1), (11-11), (11-11), ( ) (-1), (-1), (1-11), (1-11), (1-11), (11-111), (1-11), (11-111), (-1), (1-11), (1-11), (11-111) (-11), (-11), (1-111), (1-111) (-11), (1-111) (-111) (1-1), (1-1), (11-11), (11-11) (1-1), (11-11) (1-11), (11-1) (1-11) S A S1 B S2 C S3 Odd number of links S S1 S2 A B C S3 Even number of links Total Energy = E(s) + E(s1) + E(s2) + E(s3) - E(a) - E(b) - E(c) Fig. 2. Performance evaluation of links The NoC router A router architecture that can be utilized in a 2D mesh topology is shown in Fig. 3. The router consists of five unit routers to communicate in X-minus, X-plus, Y-minus, and Y-plus directions, and with the processor. Unit routers inside a single router are connected through a 5 5 crossbar. Data is transferred across routers or between the processor and the corresponding router by an asynchronous handshaking protocol. A single unit router is highlighted in lower half of Fig. 3. It consists of input and output link controllers, virtual channels, a header decoder and an arbiter. Data arrives at an input virtual channel of an unit router from either the previous router or the processor connected to the same router. The header decoder decodes the header flit of the packet after receiving data from the input virtual channel, decides the packet s destination direction (X ; Xþ; Y ; Yþ; processor), and sends a request to the arbiter of the unit router in

8 36 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Crossbar Control Lines To/From "Y+" Router To/From Processor Out FIFO Request and Grant Lines Link controller, Header decoder Arbiter & FIFO Link controller, Header decoder Arbiter & FIFO Link Control Lines To/From "X _ " Router Link controller, Header decoder Arbiter & FIFO Cross Link controller, Header decoder Arbiter & FIFO To/From "X+" Router Link Data Lines Link controller, Header decoder Arbiter & FIFO Data Lines To/From "Y grant clear req Signal from out FIFO through crossbar Signal to out FIFO through crossbar full Header Decoder wr_req N rd_req empty N Input Link Controller Error Decoder full wr_e N N wr_req wr_ack wr_vcid Data to crossbar rd_vcid Virtual Channel N GT Virtual Channel... GT Virtual Channel 1 BE In FIFO Data from neighbouring router wr_vcid Data from crossbar full N Control to crossbar Arbiter req clear grant Virtual Channel N GT Virtual Channel... GT Virtual Channel 1 BE Out FIFO empty rd_e N N Error Encoder Output LinkController Data to neighbouring router rd_req rd_ack rd_vcid Fig. 3. Router architecture.

9 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Table 2 Unit components Unit full adder 2-bit comparator 1-bit flip at the output.96 pj Output transition.15 pj 2-bit flip at the output.168 pj Input change but no output change.78 pj Input change but no output change.552 pj Leakage.77 fj Leakage.438 fj 2-1 Multiplexer D Flip-Flop Output transition.61 pj Output transition.189 pj Input change but no output change fj Input change but no output change.14 pj Leakage.13 fj Leakage.34 fj Nand gate Xor gate Output transition.312 pj Output transition.675 pj Input change but no output change.117 fj Input change but no output change.159 pj Leakage.26 fj Leakage.126 fj the corresponding direction. Once the grant is received the header decoder starts sending data from the input to the output virtual channel through the crossbar. The complete architecture and the detailed implementation can be found in [44]. We designed RTL models for each of the components separately. The larger components were characterized in terms of unit components like unit full adder, 2-bit comparator, 2:1 1-bit multiplexer, D flip-flop, and logic gates. SPICE net-lists for :18 mm technology were extracted for each component and characterized for energy and performance (shown in Table 2). Power consumption of the entire router architecture is computed by including the characterized energy values as table lookups in the RTL model. 4. Quality-of-service schemes In this section we describe the QoS schemes that are supported by our architecture, and their performance and power characterization. The NoC architecture supports two levels of service: best effort (BE) and guaranteed throughput (GT). Each packet is divided into multiple flits. The flit is a unit of transfer between two routers. The packets are routed by a deterministic dimension ordered source routing strategy. This deadlock free strategy first transmits the packet in X-dimension till the x-offset is zero, and then the packet is transmitted in the Y-dimension. Both the service levels ensure guaranteed and in-order delivery of packets. In the following few paragraphs we first describe the BE service level, and then the GT service level Best effort traffic service level The BE traffic service level packets are injected from the input queue into the input virtual channel of the router by the processor if the channel is not full. The processor checks the full

10 362 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) signal before injecting the packet. Inside the network, the same strategy is followed to transmit the each flit of the packet from the output virtual channel of one router to the input virtual channel of the neighboring router. Such a transmission strategy acts as an explicit hop-to-hop flow control mechanism, and together with the dimension ordered routing ensures guaranteed, and in-order delivery of packets. There is a round robin priority based scheduling mechanism for each of the following tasks: Selection of an input virtual channel by the header decoder. Selection of an output virtual channel by the arbiter. Grant of the crossbar to the header decoder by arbiter. Selection of the output virtual channel by the link controller. In all the above decision mechanisms the scheduler is invoked if (i) the packet is partially transmitted and blocked, or (ii) after complete transmission of each packet. Since all the packets are of the same size, the BE round robin priority scheme approximates the theoretically optimal, work-conserving generalized processor sharing (GPS). The GPS scheme provides fair allocation of link bandwidth to all the packets Guaranteed throughput traffic service level Many applications demonstrate bursty traffic behavior that must be transmitted from source to destination with a required throughput and low jitter. Examples are traffic between a cache and lower level memory, or between various processing blocks of a multimedia processing engine. As demonstrated in Fig. 1, the BE traffic service level is unable to support the desired QoS. We support guaranteed throughput traffic by dividing the virtual channels between GT and BE service levels. The number of virtual channels assigned to each service level is a design parameter that is specified by the designer. In the case of heavy network load the GT traffic can be transmitted on the BE virtual channels, but not vice versa. The round robin service mechanism is modified to give priority to the GT traffic over the BE traffic. Among each of the two service levels, every virtual channel gets equal priority. The GT traffic is always transmitted as a stream of packets with a designer specified fixed size. At the processor, the GT packets are queued until the stream size is reached. Once the desired stream size is reached, the GT protocol performs the following three steps; connection set-up, transmission, and tear-down. In the connection set-up, the virtual channels are reserved for the stream all the way from the source to the destination. The connection set-up stage might take a variable amount of time based on the network load. Once the connection is set-up the stream can be transmitted with maximum throughput. After the entire stream has been transmitted the reserved virtual channels are set free by tear-down step. Since, the GT traffic is always transmitted as a stream with maximum throughput, it prevents under-utilization of resources. Further, since the GT traffic is transmitted in discrete streams of fixed sizes, starvation of other GT traffic is also prevented. As the GT traffic can utilize virtual channels that are allocated for BE traffic, there is a possibility for starvation of BE traffic at high injection rates. However, as the experimental results will demonstrate the starvation can be easily avoided by limiting the ratio of GT/BE traffic

11 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) to be around.25 for a router with 4 virtual channels (two virtual channels allocated to GT). This is not un-realistic as only a small portion of the total network traffic is expected to be supported on the GT traffic class Architecture and characterization for QoS schemes The basic router [44] supporting only BE service levels has been enhanced to support multiple levels of service as shown in Fig. 3. The round robin priority based scheduling units present in header decoder, arbiter and output link controller have been modified to give priority to channels transferring GT traffic. For instance, if there are N virtual channels per node in the router, and K of these N channels have been allocated to transmit GT traffic, then the schedulers assign priority to these K channels to transfer data. If GT traffic is not present, then BE packets are allocated resources in a round robin manner. The energy model for the modified architecture is implemented utilizing unit components shown in Table Error control schemes In the nanoscale regime, crosstalk in long global communication wires is expected to be the major source of errors. In this paper, we focus on the crosstalk errors in the links between the routers. The error control schemes are incorporated into the output and input link controllers, respectively. The output link controller includes the encoder, and the input link controller includes the corresponding decoder. Due to the strict constraints on low latency and power consumption requirements, we have implemented low overhead error control schemes. The two schemes that we implemented include PAR, and SEC. Single error detection and retransmission (PAR): The basic single bit parity check method is used to detect the error, and re-transmission of data is requested in the presence of error. The main idea behind this scheme is to enable error recovery based on the re-transmission. The hardware overhead is negligible since it requires only one extra bit of information per flit of data transfer. However, latency per packet increases in case of retransmission. Single error correction (SEC): The basic (15,11) Hamming code [9] implementation with a single error correction capability is utilized for this scheme. The decoder present in the input virtual channel controller of a router is more complex than the encoder at the output virtual channel controller, because of the correction circuitry. The hop-to-hop transmission of 11 bit data requires 4 additional check bits Architecture and characterization of error control schemes In our architecture, we have primarily concentrated on modification of the link controllers to incorporate the error model as shown in Fig. 3. The data is encoded at the output link controller and is subsequently decoded at the input link controller before progressing through the next router towards its destination. This hop based error detection and correction mechanism allows strong error control. The functionality and characterization of the link controllers have been described below.

12 364 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Single error detection and retransmission (PAR): This scheme is implemented as shown in Fig. 4. The input link controller has 2 states, S and S1. S represents the idle state, in which the state machine waits for a req from the output virtual channel of the neighboring router. Once it receives a req from the output virtual channel, it checks the output of the parallel error detection circuitry. In absence of an error, it goes to S1 raising the ack signal and also the write signal (to its own infifo) high. In state S1, it lowers the write signal and stays in this state as long as the req signal remains high. Once the req signal is lowered, it returns to S lowering the ack signal in the transition. In presence of error, it shifts to state S1 raising the ack signal to the previous output link controller. However, it maintains the write signal low in this case and waits for req signal to go low to shift back to S, while raising the re-transmit signal. The output link controller has a complimentary state sequence as shown in Fig. 5. The characterized energy values for both the link controllers are also shown in Fig. 4 and 5. Single error correction (SEC): This scheme is similar to the above scheme with the controllers having 2 states each. The difference lies in the state S of the input link controller where an error detection leads to a subsequent correction of the error before shifting to state S1. The error = '', REQ = '1' / ACK = '1', write = '1' or error = '1', REQ = '1' / ACK = '1', write = '' REQ = '' / ACK = '', write = '' (E =.275 pj) (E =.24 pj) S S1 (E =.24 pj) (E =.225 pj) REQ = '1' / ACK = '1', write = '' REQ = '' / ACK = '', write = '' retransmit = error Leakage energy value for the circuit =.6 fj Fig. 4. Input link controller. ACK = '' and ivc!= full and ovc!=empty/ REQ = '' read =! retransmit (E =.261 pj) ACK!= ''/ REQ = '' read = '' (E =.9 pj) S S1 (E =.9 pj) ACK = ''/ REQ = '1' read = '' (E =.219 pj) ivc = input virtual channel ovc = output virtual channel ACK = '1'/ REQ = '1' read = '' Leakage energy value of the circuit =.2 fj Fig. 5. Output link controller.

13 characterized values of both the link controllers are similar to those shown in Figs. 4 and 5. We characterize the PAR and the SEC circuitry in terms of unit xor gates (energy values shown in Table 2) Error generation model Hegde et al. [45] developed a model for noise from various sources in CMOS circuitry as a Gaussian source. The model has been applied towards error estimation in SoC architectures [19,46]. In the model, it is assumed that the gate input is in error when the noise voltage V N exceeds the gate decision threshold voltage V th which is defined as V th ¼ V dd 2 The model assumes that a signaling waveform has a certain noise V N added on to it, and V N has a normal distribution with a variance of s 2 N and mean of. The probability of error is given by ¼ Q V dd 2s N P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) ; where QðxÞ ¼ Z 1 x 1 p ffiffiffiffiffi e y2 =2 dy 2p is the Gaussian pulse. We utilize the above model to generate errors in the individual wires of the NoC links. 6. Packet format and protocol The message is partitioned into fixed length packets that are in turn broken down into flits for efficient data transfer. A packet consists of three kinds of flits the header flit, the data flit and the tail flit, that are differentiated by two bits of control information. The header flit contains information of the destination router (X,Y) for each packet. The header flit contains additional information of one bit to indicate whether it is a best effort or a guaranteed throughput packet. 7. Results We performed design space exploration and performance versus power trade-off analysis for a 4 4 mesh topology of a NoC based interconnection network. Each unit router consisted of 4 virtual channels, with 2 channels each allocated to GT and BE traffic service levels. The physical channels supported unidirectional communication with both data and control bits. International Technology Roadmap for Semiconductors (ITRS) predicts that in future the die size for high end SoC architectures would be around 22 mm 22 mm: Kumar et al. [8] have also made similar predictions. Hence, we assume a chip dimension of 2 mm 2 mm and consider the inter-router links to be 4.5 mm. In our experiments, the simulator generated two varieties of traffic to random destinations uniformly distributed traffic and Poisson distributed traffic. The traffic was injected through the 16 processors by utilizing a uniform/poisson distribution over a designer specified

14 366 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) time interval. In our architecture, due to the asynchronous communication protocol, it takes two clock cycles to transfer each flit. The network was allowed to stabilize for the first 1 cycles, after which it was run for 1, clock cycles. At the end of 1, clock cycles the total number of packets reaching the destination, their acceptance rate, and latencies were calculated. The acceptance rate is the number of packets received at the destination per cycle per node. The average dynamic and leakage power consumption of the various components was also calculated over 1, clock cycles. The clock width was assumed to be 3 ns. In the following plots, we distinguish between queue and network latency. The queue latency denotes the amount of time spent by the packet at the source queue after its generation, and before its injection into the network. The network latency denotes the time required by the packet to transmit from source to destination. The total latency of the packet is summation of the queue and network latency. Additionally, for the GT traffic packets, we consider the set-up latency as the time required to reserve the virtual channels from source to destination. The BE packets were assumed to consist of 5 flits. The GT packets also consisted of 5 flits, and the GT stream was assumed to be 15 packets long. At a particular injection rate, the number of GT and BE packets to be generated are specified as a ratio r ¼ GT=BE: The queue latency of the GT traffic is calculated as the difference between the time when the total stream has been generated and the time when the stream is injected to the network Evaluation of QoS schemes Fig. 6 plots the variation in network latencies of GT and BE traffic when the destination is 3 hops away from the source at an injection rate of.5 packets/cycle/node. While the BE traffic experiences a wide spectrum of network latencies, the GT traffic latency spectrum has a sharp 7 6 GT BE 5 Noof Packets Latency (cycles) Fig. 6. Spectrum for BE/GT.

15 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) spike. This plot validates that our router is able to provide guaranteed low jitter latency for GT traffic transmission. Figs. 7 and 8 plot the network latency of the BE and GT traffic as the injection rate is varied from.25 to.1, and r is varied from.25 to 1. As can be observed from the plots, for all values of r, as the injection rate is increased the average network latency of the BE traffic increases. There is also an increase of average BE network latency with increasing r values, since more priority is given to GT traffic over BE traffic. The average network latency for the GT traffic, on the other hand, remains almost constant. Latency (cycles) Fig. 7. Network latency for BE. Latency (cycles) Fig. 8. Network latency for GT.

16 368 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Figs. 9 and 1 plot the variation in queue latency of BE, and GT traffic, respectively. The queue latency for BE traffic increases dramatically with rise in injection rate and r. This observation is supported by the BE acceptance rate plot shown in Fig. 12. It should be noted that BE queue latency soars to around 3 clock cycles for r ¼ 1 and injection rate :1: The queue latency for GT traffic for lower injection rates and low values of r remains negligible as less number of GT packets are generated by the processors and the resources can easily cater to them without any congestion. However, for higher values of r with higher injection rates we observe a considerable increase in queue latency because of high network congestion between GT traffic. 3 Latency (cycles) Fig. 9. Queue latency for BE. Latency (cycles) Fig. 1. Queue latency for GT.

17 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Fig. 11 plots the variation in connection set-up latency for the GT traffic. The set-up latency increases with both the injection rate and ratio r. The increase has the smallest slope for r ¼ :25: Figs plot the acceptance rates for BE, GT, and combined traffic, respectively. As can be seen from the plots, at a particular r value the BE acceptance rate initially increases with increase in injection rate. It peaks at around.5 injection rate, and then falls. However, the acceptance rate for GT traffic increases linearly with increase in injection rate and r. Priority of GT traffic over BE traffic helps explain the variation in BE acceptance rate. The combined network acceptance rate rises linearly with the injection rate before the network is congested, and is constant after congestion Latency (cycles) Fig. 11. Setup latency for GT. AcceptanceRate (packets/cycle/node) Fig. 12. Acceptance rate BE.

18 37 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Acceptance Rate (packets/cycle/node) Fig. 13. Acceptance rate GT. Acceptance Rate (packets/cycle/node) Fig. 14. Acceptance rate BE and GT. Figs. 15 and 16 plot the variation in average dynamic and leakage power of the NoC for the variation in injection rates and r, respectively. The dynamic power consumption closely follows the combined BE and GT acceptance rate plot shown in Fig. 14. At higher acceptance rates, the dynamic power consumption is high, and vice versa. Also the peaks in dynamic power consumption plots are mirrored by troughs in leakage power consumption, and vice versa. The virtual channel buffers are the main contributors to both dynamic and leakage power consumption in NoC. Fig. 17 plots the power consumed by the buffers at.5 injection rate. There is an increase in the power consumption of the GT virtual channel buffers as the GT/BE

19 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Power (mw) Fig. 15. Dynamic power BE/GT Power (mw) Fig. 16. Leakage power BE/GT. ratio increases from.25 to 1. since the utilization of the GT virtual channels increases with the increasing values of r. However, for GT/BE ratio of.5 we see the power consumption in the BE virtual channel buffers to be more than the GT virtual channel buffers. This is observed since the BE virtual channels can be used to transfer GT traffic but not vice versa. The power consumption of the individual components of the router network for an injection rate of.5 for the different values of r has been shown in the Fig. 18. It can be seen from the plots that the virtual channel buffers are the dominant consumers of total power. It can also be

20 372 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Power (mw) Fifo_BE Fifo_GT Fig. 17. Fifo power BE/GT Power (mw) FIFO Headerdecoder Arbiter Crossbar VitualControllers Links Fig. 18. Component power. observed that the header decoders, arbiter, and the link controllers also contribute significantly to the total power consumption. Figs show similar plots for the router network under Poisson traffic distribution. It should be noted that the results of the latencies, acceptance rates and power consumption for the Poisson traffic model is very similar to that of the uniform random traffic model. This proves that our router design can effectively support both kinds of traffic profiles.

21 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Latency (cycles) Fig. 19. Network latency for BE(Poisson). Latency (cycles) Fig. 2. Network latency for GT(Poisson). The following conclusion can be inferred from the extensive experimentation performed with our router architecture supporting multiple levels of service: For a low value of r ¼ :25; the GT traffic experiences almost zero queue latency and a low setup latency. Also the acceptance of BE traffic is high for this case. Hence, a low value of r (around.25) should be utilized when designing a NoC with GT and BE traffic service levels Evaluation of error control schemes We characterized the NoC for :18 mm technology, and consider V dd ¼ 1:8V: We evaluated the performance of the error control schemes by assigning the noise voltage variance, s N to.5 V

22 374 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Latency (cycles) Fig. 21. Queue latency for BE(Poisson). Latency (cycles) Fig. 22. Queue latency for GT(Poisson). [45] and.36 V, respectively. The corresponding bit error rate, is.35 (high, H in plots) and.63 (low, L in plots), respectively. The ratio of GT/BE packets generated, r, has been taken to be.25. Fig. 3 plots the overall acceptance rate of the NoC under low and high error rates using both the PAR and SEC error control schemes. The acceptance for the PAR scheme is lower than the SEC scheme for higher injection rates because of the latency involved in retransmission. For lower injection rates, the difference in the acceptance rates between the two schemes diminishes due to less traffic in the network.

23 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Latency (cycles) Fig. 23. Setup latency for GT(Poisson). Acceptance Rate (packets/cycle/node) Fig. 24. Acceptance rate BE(Poisson). Fig. 31 plot the network latencies under various injection and bit error rates. The network latency is always higher for the PAR scheme due to retransmission delay. This is reinforced by the overall acceptance plot in Fig. 3. The average latency is higher at high bit error rates because more number of flits are prone to error and are hence retransmitted. Fig. 32 shows the network power consumption for low and high error rates using both the PAR and SEC schemes. The SEC power consumption for high injection rates is more than PAR due to high acceptance rates for SEC. For low injection and low bit error rates, the power consumption for the SEC scheme is almost equal to PAR scheme. However, the area consumed by the PAR implementation is lower than the SEC scheme, making it an attractive technique for error control

24 376 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Acceptance Rate (packets/cycle/node) Fig. 25. Acceptance rate GT(Poisson). Acceptance Rate (packets/cycle/node) Fig. 26. Acceptance rate BE and GT(Poisson). in this case. For all other cases ({low bit error rate, high injection rate}, {high bit error rate, low injection rate}, {high bit error rate, high injection rate}), SEC is a preferred choice due to high acceptance rates. Moreover, for low injection and high bit error case, the power consumed by the retransmission circuitry offsets the power consumed by error correction. The results for the error control schemes are summarized in Table 3. The table shows the appropriate error control schemes under different injection and bit error rates respectively. Fig. 33 shows the leakage power consumption for low and high error rates using both the PAR and SEC schemes. Leakage power consumption is more in the PAR scheme than in the SEC scheme since the dynamic power consumption is less and vice versa.

25 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Power (mw) Fig. 27. Dynamic power BE and GT(Poisson) Power (mw) Fig. 28. Leakage power BE/GT(Poisson). 8. Conclusion In this paper, we presented a cycle accurate performance and power evaluation model for BE and GT traffic with error correction/detection on mesh-based NoC. We presented results for extensive design space exploration and performance versus power trade-off analysis of a 4 4 mesh architecture. The experimental results were presented for both uniform and Poisson traffic distributions. The results demonstrated that our architecture is able to provide excellent support for both GT and BE traffic schemes as long as the GT/BE traffic ratio is around.25. On

26 378 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Power (mw) FIFO Headerdecoder Arbiter Crossbar VitualControllers Links Fig. 29. Component power(poisson). Acceptance Rate (packets/cycle/node) SEC_H Parity_H SEC_L Parity_L Parity_L SEC_L Parity_H SEC_H Injection Rate Fig. 3. Acceptance rate PAR/SEC. the basis of their performance and power consumption characteristics it was also shown that PAR (single error control) scheme is better than the SEC (single error correction) at low injection and low error rates. In all other circumstances the SEC scheme gives better performance. The current version of the model is limited to mesh based topologies supporting deterministic routing schemes and synthetically generated traffic. Future work will address developing

27 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Latency (cycles) SEC_H Parity_H SEC_L Parity_L 1 5 Parity_L SEC_L Parity_H SEC_H Injection Rate Fig. 31. Network latency PAR/SEC. Power (mw) SEC_H Parity_H SEC_L Parity_L Parity_L SEC_L Parity_H SEC_H Injection Rate Fig. 32. Dynamic power PAR/SEC. router architectures and related power and performance models for generic topologies. Adaptive routing schemes would also be explored. Finally, design space exploration would be performed with communication traces of realistic benchmark applications that are mapped to NoC architectures.

28 38 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) Table 3 Summary of error control schemes Injection rate (Low) Injection rate (High) Bit error rate (low) PAR SEC Bit error rate (high) SEC SEC SEC_H Power (mw) Parity_H SEC_L Parity_L Parity_L SEC_L Parity_H SEC_H Injection Rate (packets/cycle/node) Fig. 33. Leakage power PAR/SEC. References [1] D. Sylvester, K. Keutzer, A global wiring paradigm for deep submicron design, IEEE Trans. Comput. Aided Design Integrated Circuits Systems (2) [2] R.Ho, K. Mai, M. Horowitz, The future of wires, Proc. IEEE (21) [3] J. Davis, D. Meindl, Compact distributed RLC interconnect models Part II: coupled line transient expressions and peak crosstalk in multilevel networks, IEEE Trans. Electron Devices 47 (11) (2) [4] P. Guerrier, A. Greiner, A generic architecture for on-chip packet-switched interconnections, in: DATE, Paris, France, March 2. [5] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabeay, A. Sangiovanni-Vincentelli, Addressing the system-on-a-chip interconnect woes through communication-based design, in: Proceedings of Design Automation Conference, June 21, pp [6] William J. Dally, Brian Towles, Route packet, not wires: on-chip interconnection networks, in: Proceedings of DAC, June 22. [7] Luca Benini, Giovanni De Micheli, Networks on chips: a new SoC paradigm, IEEE Comput. (22) 7 78.

29 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) [8] S. Kumar, A. Jantsch, M. Millberg, J. Oberg, J.P. Soininen, M. Forsell, K.T.A. Hemani, A network on chip architecture and design methodology, in: IEEE Computer Society Annual Symposium, on VLSI, Pittsburg, Pennsylvania, April 22. [9] S. Lin, D.J. Costello, Error Control Coding: Fundamentals and Applications, Prentice-Hall, Englewood Cliffs, NJ, [1] A. Andriahantenaina, A. Greiner, Micro-network for SoC: implementation of a 32-port SPIN network, in: DATE, Munich, Germany, March 23. [11] A. Andriahantenaina, H. Charlery, A. Greiner, L. Mortiez, C.A. Zeferino, SPIN: a scalable, packet switched, onchip micro-network, in: DATE, Munich, Germany, March 23. [12] D. Siguenza-Tortosa, J. Nurmi, Proteo: a new approach to network-on-chip, in: Proceedings of IASTED International Conference on Communication Systems and Network, Malaga, Spain, 22. [13] D. Siguenza-Tortosa, J. Nurmi, VHDL-based simulation environment for Proteo NoC, in: High-Level Design Validation and Test Workshop, Paris, France, October 22. [14] M. Dall Osso, G. Biccari, L. Giovanninni, D. Bertozzi, L. Benini, Xpipes: a latency insensitive prameterized network-on-chip architecture for multi-processor SoCs, in: Proceedings of ICCD, San Jose, CA, October 23. [15] M. Millberg, E. Nilsson, R. Thid, S. Kumar, A. Jantsch, The Nostrum backbone a communication protocol stack for networks on chip, in: VLSI Design Conference, Mumbai, India, January 24. [16] M. Millberg, E. Nilsson, R. Thid, A. Jantsch, Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip, in: DATE, February 24, pp [17] J. Dielissen, A. Ra dulescu, K. Goossens, E. Rijpkema, Concepts and implementation of the Philips network-onchip, in: IP-Based SOC Design, November 23. [18] E. Rijpkema, K.G.W. Goossens, A. Radulescu, Trade offs in the design of a router with both guaranteed best-effort services for networks on chip, in: DATE, 24. [19] D. Bertozzi, L. Benini, G. De Micheli, Low power error resilient encoding for on-chip data buses, in: DATE, 23. [2] H. Zimmer, A. Jantsch, A fault model notation and error-control scheme for switch-to-switch buses in a networkon-chip, in: ISSS/CODES, 23. [21] F. Worm, P. Ienne, P. Thiran, G. De Micheli, An adaptive low-power transmission scheme for on-chip networks, in: Proceedings of ISSS, Kyoto, Japan, 22. [22] X. Chen, L.-S. Peh, Leakage power modeling and optimization in interconnection networks, in: Proceedings of ISLPED, Seoul, Korea, 23. [23] T. Simunic, S. Boyd, Managing power consumption in networks on chips, in: Proceedings of DATE, Paris, France, 22. [24] J. Duato, S. Yalamanchili, L. Ni, Interconnection networks, an engineering approach, IEEE Computer Society, [25] H.J. Seigel, A model of SIMD machines and a comparison of various interconnection networks, IEEE Trans. Comput. 28 (12) (1979) [26] W.J. Dally, Performance analysis of k-ary n-cube interconnection network, IEEE Trans. Comput. 39 (6) (199) [27] J.F. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, J. Parallel Distributed Comput. 23 (1994) [28] D. Brooks, V. Tiwari, M. Martonosi, Wattch: a framework for architectural-level power analysis and optimizations, in: International Symposium on Computer Architecture, 2, pp [29] W. Ye, N. Vijaykrishna, M. Kandemir, M.J. Irwin, The design and use of simplepower: a cycle-accurate energy estimation tool, in: Proceedings of Design Automation Conference, June 2. [3] T. Givargis, F. Vahid, J. Henkel, Instruction-based system-level power evaluation of system-on-a-chip peripheral cores, IEEE Trans. VLSI 1(6) (22). [31] Arm Inc., AMBA specification, [32] IBM, The coreconnect bus architecture, [33] D.Wingard, MicroNetwork-based integration of SOCs, in: DAC, Las Vegas, Nevada, June 21. [34] A.G. Wassal, M.A. Hasan, Low-power system-level design of VLSI packet switching fabrics, IEEE Trans. CAD 2 (21)

30 382 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (25) [35] Terry T. Ye, Luca Benini, Giovanni De Micheli, Analysis of power consumption on switch fabrics in network routers, in: Proceedings of DAC, 22. [36] D. Pamunuwa, J. Oberg, L.R. Zheng, M. Millberg, A. Jantsch, H. Tenhunen, Layout, performance and power trade-offs in mesh-based network-on-chip architectures, in: IFIP International Conference on Very Large Scale Integration (VLSI-SOC), Darmstadt, Germany, December 23, pp [37] H.-S. Wang, L.-S. Peh, S. Malik, Orion: a power-performance simulator for interconnection network, in: International Symposium on Microarchitecture, Istanbul, Turkey, November 22. [38] E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, Cost considerations in network on chip, in: Integration the VLSI Journal, November 23. [39] A. Pinto, L.P. Carloni, A.L. Sangiovanni-Vincentelli, Efficient synthesis of networks on chip, in: ICCD, 23. [4] J. Hu, R. Marculescu, Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints, in: DATE, Paris, France, February 24. [41] S. Murali, G. De Micheli, Bandwidth constrained mapping of cores onto NoC architectures, in: DATE, Paris, France, February 24. [42] P. Sotiriadis, A. Chandrakasan, A bus energy model for deep sub-micron technology, IEEE Trans. VLSI 1(3) (22). [43] Berkeley predictive technology modeling, technical report. [44] N. Banerjee, P. Vellanki, K.S. Chatha, A power and performance model for network-on-chip architectures, in: DATE, 24. [45] R. Hegde, N. Shanbhag, Towards achieving energy efficiency in presence of deep submicron noise, IEEE Trans. VLSI 8 (4) (2) [46] L. Li, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, Adaptive error protection for energy efficiency, in: ICCAD, 23.