Bare-metal message passing API for many-core systems

Transcription

1 Bare-metal message passing API for many-core systems Johannes Scheller Institut Superieur de l Aeronautique et de l Espace, Toulouse, France Eric Noulard and Claire Pagetti and Wolfgang Puffitsch ONERA-DTIM, Toulouse, France This paper describes a bare-metal multiple instruction, multiple data message passing library for the Intel Single Chip Cloud Computer. The library s design, the derived notion of a global time across the cores and the verification of the send and receive functions are highlighted. Finally, a use-case example implementing a pseudo AFDX network shows that the message passing performance is not limited by the mesh network but by the workloads of the cores. A. Context I. Introduction Safety critical embedded systems require a timing predictable implementation in order to prove their correctness. For both cost and time reasons, embedded systems rely heavily on commercial off-the-shelves (COTS) hardware. Evermore complex chip designs and shared caches in commercially available chip multiprocessors (CMP) make it difficult, if not impossible, to ensure predictable timing behavior. For instance, the time to execute a simple memory access may vary greatly due to mechanisms such as pipelining, speculative and out-of-order execution, etc. 1 The many-core technology holds potentially the key to resolving some of these issues. These processors combine numerous cores with a Network-on-Chip (NoC) based communication networks. This structure has various advantages. Reducing conflicts and allowing full control over when cores communicate is one of these, achieved by explicit message passing. Furthermore, the large number of cores on these chips allow dedicating a core, or a cluster of cores to a single application. The Intel Single-chip Cloud Computer (SCC) is an experimental platform designed to evaluate many-core processors and hardware-assisted message passing using an on die mesh network. 2 B. Programming solutions for predictability In addition to adding a considerable overhead, the use of an operating system (OS) also introduces additional non-predictable mechanisms such as the use of interrupts, preemption, etc. Bare-metal programming usually means programming directly on the underlying hardware and thereby avoiding this additional overhead. In addition, the programmer has full control over the execution of the application, resource usage, and the allocation on the chip. The user has the choice between using a single program, multiple data (SPMD) and a multiple instruction, multiple data (MIMD) programming model. In SPMD programming the same executables are launched upon each core. Whereas, in the MIMD programming model each core is provided with its individual executable thereby significantly reducing the memory footprint on each core. In addition MIMD reduces the complexity of the individual programs run on the cores and the increases the predictability, as a core s application contains only the essential parts for this specific core. Young graduate, address: johannes.scheller@gmail.com Researcher, address: eric.noulard@onera.fr Researcher, address: claire.pagetti@onera.fr Post-doctoral research felloy, address: wolfgang.puffitsch@onera.fr 1 of 11

2 C. Related work The work presented in this paper is based upon the Baremichael bare-metal framework. 3 Baremichael is a minimalistic bare-metal framework for the Intel SCC. The framework brings the cores from real into protected mode and sets up a flat memory space. Since version 6 Baremichael supports SPMD message passing using RCCEv2. RCCE is the message passing library provided by Intel. 4, 5 It is based upon put and get primitives which conduct a memory copy from and to the desired memory address. RCCE was designed to support both Linux and C. It uses a static SPMD model. Therefore, the user has to specify the core s with which to communicate upon startup. The same executables are then launched on the specified cores. Since the RCCE message passing library only allows SPMD programs, it did not fit our demands. ircce, which was developed by RWTH Aachen, extends RCCE by non-blocking send and receive primitives and uses an improved memory copy algorithm. 6 Nevertheless, ircce still relies on a static SPMD programming model hence it is not suitable for our purposes either. On the other hand, both the send and receive functions of our message passing library use the ircce put and get functions in order to copy data. ET International (ETI) also offers a bare-metal framework for the Intel SCC. 7 ETI provides full libc support and also includes a simulation environment for Linux. Unfortunately the ETI framework is closed source and only supports SPMD programs. The provided streaming API, which allows for inter-core communication, does not use the MPBs in order to transfer messages. Instead, the API uses the on-tile MPBs only to signal the transfer of a message. The messages themselves are transferred using the external DRAM. Since the ETI framework uses the MPB only for control and not for messages, and it only supports a SPMD programming model, it does not fulfill all our needs. MetalSVM, 8 which was developed by RWTH Aachen, implements a bare-metal hypervisor on the Intel SCC. Their implementation goes beyond simply exchanging messages between independent applications. They create a shared virtual memory that can be accessed by the applications. Finally Altreonic ported OpenComRTOS, 9 which is a network centric real-time operating systems, to the SCC. D. Contribution The BareMichael bare-metal framework (BMBMF) provides the ability to run bare-metal programs on the SCC. 3 Building upon BMBMF, this paper proposes a message passing library suitable for multiple instruction, multiple data (MIMD) applications which enables the user to run independent code on each core. This significantly reduces the memory footprint on the cores. In order to ensure the correctness of the created library, the send and receive functions are modeled using UPPAAL, which is a tool for modeling 10, 11 and verifying networked timed automata. The remainder of this paper is organized as follows. Section 2 takes a closer look at message passing on the Intel SCC. To that end, the SCC s architecture is detailed and the basic functions making up the message passing library are explained. The send and receive process between two communicating cores is illustrated as is the message passing performance of the designed library. Section 3 describes the verification of the library using UPPAAL. Furthermore, the derivation of a global clock on all cores is presented. Finally, the dependence of the message passing performance on the mapping of senders and receivers on the NoC is analyzed. This analysis is done by means of a case study based upon a pseudo Avionics Full DupleX switched ethernet (AFDX) network. II. A message passing library for the SCC A. Brief description of the Intel SCC The Intel SCC shown in Figure 1(a), is a research processor consisting of 24 tiles, arranged in a 6 4 matrix. The tiles are interconnected by a mesh network. Each tile holds two P54C cores, 12 which are derived from very basic second generation Pentium processors. These P54C cores both have a private L1 (16 KB data and instruction cache) and L2 (256 KB) cache. In addition, they also share a local memory buffer, the MPB (16 KB). Data in one tile s MPB can be accessed from any other tile. Requests to access the MPB of another tile are translated by the network and executed using simple (X,Y) routing. 13 Therefore, the MPBs can be seen as distributed shared memory with non-uniform access time. For data written to and read from the MPB, Intel has also introduced a special memory type, the message passing buffer type (MPBT). Data flagged as MPBT is not cached in the L2 cache. Furthermore, cache lines in the L1 cache flagged as MPBT can be invalidated using a special instruction, enabling users to force reads and writes. Each tile also 2 of 11

3 contains the mesh interface unit (MIU), which translates between the core s bus interface and the network, and the write combine buffer (WCB). This WCB stores up to a whole cache line of MPBT flagged data before writing it onto the network. The WCB can be flushed by issuing a write to another memory line flagged MPBT. R R R R R R Fool line DC R R R R R R DC Send flags Ack flags R R R R R R Setup space MPB space DC R R R R R R DC Data space VRC System Controller/Interface (a) Illustration of the Intel SCC R = ROUTER VRC = Voltage Regulator Controller DC = DDR3 Controller (b) Layout core s MPB space Figure 1: Intel SCC and MPB B. Message passing paradigms Based on the structure of the SCC, two basic message passing methods can be used. Either the data can be written locally and read remotely (PULL), or the data can be written remotely and read locally (PUSH). To implement these methods, we split the MPB in two parts: one part dedicated to overhead data such as flags and one part for data. The flags are used to inform the receiver of a pending message and to acknowledge its reception by the sender. As illustrated in Figure 1(b), for both send and receive flags 64 Bytes of MPB space are allocated. Each core has its send and receive flag padded up to 32 Byte, hence a total of 128 Bytes. Furthermore, a fool line is allocated in order to flush the WCB. Finally, a single line is allocated for setup information and clock synchronisation (see Section III B). The data space either for PUSH and PULL is statically partitioned. Static partitioning was done to avoid locking, reduce complexity and avoid an a-priori knowledge of the currently communicating cores. Different cores can employ different message passing methods (PUSH or PULL). Communication between a PUSH- and a PULL-based core is not supported for the moment. PUSH-based sending requires a strict partitioning between the possible sending cores in order to avoid message inconsistency. The developed library employs a simple division of the remaining MPB space between all 48 cores. Hence each core has 160 Bytes of dedicated message passing space in each of the cores.figure 2 illustrates a PUSH-based sending process. In this example the core s 0 and 47 are communicating. The sender first puts the data into the partition dedicated to him in the receiver s MPB (orange). He then continues and raises the send flag on the receiver side (green). When reading his send flags the receiver will see the raised flag and be able to read the data out of his own MPB. The receiver then acknowledges the communication by setting his dedicated receive flag at the sender (blue). In other words the receiver writes to the memory space allocated to him to acknowledge message reception in the senders MPB. Following this handshake, the sender is able to send the next data packet to the receiver. The previously described process is similar for the PULL-based sending. In this case the message is stored in the MPB of the sender. Since only the sender is writing to its own MPB no partitioning is needed and the full 8000 Bytes can be used for data transfers. In other words while the PUSH-based sending is a combination of remote read and local write, the PULL-based sending combines a local write with a remote read. 3 of 11

4 The choice to use either PUSH or PULL is up to the user, but has to be made at compile time. Each of the mechanisms has certain advantages. While the PUSH-based mechanism is beneficial when sending multiple different messages in short intervals to different receivers, the PULL method is advantageous when issuing multicasts as the send overhead is spread. Fool line Fool line Setup space MPB 0 Setup space MPB 0 MPB 47 MPB 47 Figure 2: Illustration of the communication for a PUSH-based sending between two cores C. Send and receive functions Send and receive can be conducted in four different ways, which are governed by the type. These types are asynchronous (async), blocking (blocking), non-blocking (non_blocking) and non-blocking with timeout (non_blocking_timeout). If a send is being conducted the types lead to the following behavior: async The sender checks if he can send to the receiver. If he can, he will send all requested bytes to the receiver returning only after the whole send operation is finished. If the receiver is not ready to receive the send function will return immediately. non_blocking In the event that a non_blocking send is being conducted the sender checks if he can send to the receiver. If this check is successful, the sender will transmit the first packet, whose size is governed by the size of the available MPB partition. If the receiver directly acknowledges the successful reception of the packet, the next packet will be sent. Otherwise the sender will return the number of transmitted bytes. non_blocking_timeout A non_blocking send implementation which after each send sets a timeout. When called, the send function with this type does not only check if a previous send has occurred and/or if it has terminated, but also if for all the unterminated sends the set timeout value has been reached. In this case the receiver is signalled of the timeout occurrence and the send process continues. blocking In the event that a blocking send is being conducted the sender waits until he can send to the receiver. He then transmits the complete message to the receiver. If the message exceeds the size of the partition available to the sender the sender will split the message into multiple packets. The function will not return until all bytes have been successfully transmitted. Similarily the different possible types also influence the receive function. async The receiver checks whether or not data has been received. If data was received, the receiver will not return until all the requested bytes have been successfully received. In the event that no data is present, the receiving function will return. The receiver will return the number of bytes that were received (either 0 or the number of requested bytes). non_blocking If a non_blocking receive is being conducted, the receiver checks if data was received. If data was received, the receiver receives as many bytes as the size of the MPB partition available to him. If 4 of 11

5 further data is available, he continues the receive, else he finishes the receive process. If no data was present after the first check the receiver returns directly. The receiver returns the number of bytes that were received. non_blocking_timeout A non_blocking implementation which checks whether the sender has signalled a timeout by setting the flag accordingly in which case he returns indicating the occurred timeout to the calling function. blocking A blocking receive will cause the receiver to wait until all the requested data has been received before he returns. Both the send and receive function rely on the memory copy function originally designed for RCCE 4, 5 in order to transfer the messages into the destination core s MPB. Furthermore, as a non-blocking send does not directly require an acknowledge, and when conducting a PUSH-based send different messages can be sent to different receivers without waiting for the first receiver to acknowledge, message transfer can be conducted asynchronously. D. Comparison of PUSH and PULL mechanisms As mentioned previously two different message passing mechanisms were implemented in the library. Since the PUSH-based mechanism is based upon a remote write and a local read each core has only 160 Bytes of message passing space available in every other core. On the other hand, using the PULL-based mechanism the core can exploit the entire 8000 Bytes available in its own MPB. This difference in message passing space results in a noticeable performance difference as shown in Figure 3. For every message that exceeds 160 Bytes the PUSH-based mechanism has to wait for the receiver to acknowledge the receipt before continuing the send process. Hence, at 160 Byte, a performance gap can be observed. The decrease in performance for messages larger than 8000 Bytes for both the PUSH- and PULL-based mechanism is due to the nature of the experiment. In this simple PING-PONG example the sender transmits the message to the receiver and then waits for the receiver to respond using the same message. The receiver stores the received message and the destination buffer in its L1 cache. Therefore, for messages larger than half the size of the L1 cache, conflicts between the message reads and writes lead to a decrease of the performance until no more useful data remains in the L1 cache (at about 16 kbyte). This performance reflects the bandwidth sustainable by the L2 cache. Consequently, when messages exceed the size of the L2 cache another drop in performance occurs. 5 In addition to illustrating the performance of the PUSH- and PULL-based mechanisms for MPBT mapped memory, Figure 3 also compares the newly introduced MPBT memory type to uncached memory (UC). Since UC mapped data is not cached in the WCB but instead directly put on the network the number of packets increases significantly and hence the performance decreases. This results in a significantly lower performance of UC memory compared to MPBT memory. III. Validation of the library A. Verification of the API with Uppaal We want to verify that the message passing mechanisms work well if the functions are correctly employed by the user. For instance, if core i makes a blocking send to core j and j conducts a blocking read from i, then the transaction will succeed and the cores can continue their processing. For that purpose, we have defined several communication patterns and their expected behaviors. These were verified using the model checker Uppaal. For each function, send and receive, we modeled its behavior in a generic timed automaton (see Figure 4) which is independent of the type (blocking,... ) and of the paradigm (PUSH or PULL). The types appear in the guards of the transitions. The blocking send automaton works as follows: the sender (see Figure 4(a)) first checks if he can send to the particular destination(s) (eval_check). If a send is not possible the sender will wait, continuously checking if the send can be conducted. Once a send to the receiver(s) can be conducted the number of packets to be sent is calculated (send_rdy) and the first packet is sent (send_data). 5 of 11

6 Figure 3: Performance comparison of PULL and PUSH using a simple PING-PONG example MPBT= memory mapped as message passing buffer type memory mapped as uncached Init Init eval_check check_send no_received send_nrdy check_timeout send_rdy send_data rec_done data_received timeout_rec send_nbdone check_ack send_bdone check_send rec_nomore (a) Send (b) Receive Figure 4: Automata representations of the functions The sender will then wait for the acknowledgement from the receiver(s). Once the acknowledgement has been received the sender either sends the next packet (send_data) or, if all packets have been sent, returns (send_bdone). The communication patterns check the exchange between one emitter and one receiver. The expected behavior is expressed as a formula in temporal logic. In case of a blocking sender and a blocking receiver, we want to verify that the message will be received correctly. In the automata, it means that the sender will always reach the state send_bdone while the receiver will always reach the state rec_done. This can be expressed by the formula: A A (send.send_bdone rec.rec_done) The property has been verified with Uppaal. We have also proven that in case of a blocking sender and a non_blocking receiver the message is received correctly if the sender has issued the send before the receiver has conducted the receive and the message is not larger than the available partition in the MPB. In the automata this means that if the receiver 6 of 11

7 is still in the Init state and the sender already sent its message (check_ack) the sender will terminate in the state send_bdone while the receiver will reach the state rec_done. Similarily it was verified that if no receiver is present, a blocking send will never terminate. In the send automaton, it means that the sender will never reach send_bdone. The communication using a non_blocking sender and a blocking receiver was also verified. Once again it has to be assumed that the message size does not exceed the size of the available MPB partition. If that is true the receiver will always terminate (i.e. the state rec_done will always hold at some point in the future). Independent of this, the sender will always terminate by reaching the state send_nbdone. The communication using a non_blocking sender and a non_blocking receiver was verified by checking that if the send occurred before the receive, both send and receive will terminate correctly. In the automata, it means that the sender will reach the state send_nbdone and the receiver will reach the state rec_done. Finally it was verified that even if no receiver is present, a non_blocking send will terminate. In other words, the send automaton will reach the state send_nbdone. B. Global clock and synchronization library To validate the real-time execution of the message passing library, we need to use real-time clock primitives. Each SCC core has a Pentium TSC (TimeStamp Counter) which can be used as a locally accessible highresolution clock. The only problem is, that the different TSC are not synchronised on startup. In order to start the execution of the code on the cores the cores reset bits have to be cleared. This causes the core to leave the reset state and start the execution. As there is no global reset available, the reset bits of all cores have to be released individually causing discrepancy between the local TSC values. The Baremichael framework allows access to the global clock on the FPGA which is connected to the SCC. Since the cost to access this global clock is large compared to the local TSC and its accuracy depends on the number of cores simultaneously requesting access to the global clock a local notion of the global clock had to be derived. In order to generate global clock notion of time, each core can simply poll the global time from the FPGA and calculate the local clock s offset via t global = t local + offset local. In order to create an average value for the access time of the global clock via the mesh network, the global clock is accessed multiple times. The average Round-trip time (RTT) is then subtracted from the local time obtained after the last access of the global clock. The result of this computation is subtracted from the scaled global time resulting in the local cores offset value. The scaling of the global clock is necessary, because local and global clocks are at different frequencies. In order to avoid collisions on the mesh network the global clock should be pulled by each core individually. The thereby obtained notion of a global clock can then be used to synchronize the cores. Since we do not necessarily know how many cores are active and what kind of message passing paradigm they use, the synchronization uses the dedicated setup space in the MPB. A mastercore, which has to be specified at compile time, sends to each core a global start time at which point the cores can start the execution. Using this method a maximum clock discrepancy of 4 µs can be observed between the cores and the mastercore, which in this case is core 0. This difference can be attributed to the drift between the SCC s clock and the FPGA s clock. This causes the discrepancy in the core s notion of global time, because the sequential polling of global time prevents the cores from accessing the global clock at the same time. C. Timing experiments - Case study In order to evaluate the behavior of the NoC, a case study was constructed emulating an AFDX network. Different virtual links (VL) defined by a maximum package size, a period, and a group of receivers are mapped to different cores. The goal was to evaluate the dependence of the the delay between the send and the reception of a message on the mapping of the VLs to cores. To that end two mappings were created. Both mappings are putting the same overall workload on the cores. The major difference between both mappings is that: in the first mapping, shown in Figure 5(a), the distances between the sending and receiving node are minimized whereas in the second mapping, shown in Figure 5(b), the distances between sender and receiver are maximized just as the number of messages passing by the same router. The first mapping should therefore create as little congestion as possible on the network. On the other hand, the second mapping was designed especially to create network congestion. In addition to evaluating both an uncongested and a congested network, we also compare the PULL and the PUSH mechanisms. 7 of 11

8 (a) without congestion (b) with congestion Figure 5: Network traffic During the experiment each core sends in function of the VLs mapped to it in specific intervals messages. The send messages are encoded with the global time at which the message was sent. A core will send a message every 100 µs. When it is not sending it tries to receive. Initially an implementation of receive-any was used for checking the reception of a package. This implementation read a whole flag line at once in order to reduce the penalty from consecutively accessing single bytes from the memory. As the mechanisms after each successful read restarted the send and receive process, it was possible that cores with a higher core number sending to a core with a lower core number are never able to successfully send their message. Hence, the sending core was blocked. In order to circumvent this behavior and to be as fair as possible, the cores sequentially go through all possible senders and check whether they have received a message from that sender. This also penalizes cores with a larger core number but they do not risk being blocked entirely. Upon receipt of a message the receiving core evaluates the message s transmission time by subtracting the send time from the time of reception. The maximum message size is 160 Bytes. In order to increase the number of messages traversing the network all messages are sent at the same global time. The results for PUSH- and PULL-based sending in the uncongested case (Figure 6(a) for PULL and Figure 6(b) for PUSH) are quite distinct. Whereas in the PUSH case a maximum transmission time mirroring the core workload can be observed, the maximum transmission times for the PULL case are subject to stronger variations. Furthermore, little to no difference can be seen between the maximum delays when comparing the congested case (Figure 7(a) for PULL and Figure 7(b) for PUSH) with the uncongested case. Both cases have transmission time maxima on cores 0, 10, 37, and 47. While those peaks are higher in the congested case, the average delay between the transmission and reception of messages is higher for the uncongested case. This is especially obvious for the sends conducted using PULL. The peaks in Figures 6(b) to 7(a) can be understood when considering the number of frames to be sent and received by each core. This figure can be regarded as a measure for a core s workload. Cores 0, 10, 37, and 47, the cores who exhibit the peaks, are cores conducting multicasts. Hence, they have more frames to send and flags to set. As this requires time, receiving experiences an additional delay. These multicast cores are also the main reason for the increased maximum delay in the uncongested case for the PULL mechanism. In the PULL case, a multicast can be conducted by writing one message in the sender s MPB and setting the flags of the receiving cores. As this is done sequentially, some cores get notified later than others and observe a larger delay. When examining the mapping for the uncongested case, one can see that the concerned cores are all targets of a multicast. The higher delay at these cores is caused by the way the multicast is conducted when using PULL. While the message only has to be written once the core s flags are pulled sequentially. Therefore, a core can suffer a higher delay when his flag is set later in the multicast sequence. This effect is increased by the way the reception is conducted. The receiving cores check sequentially for each core if they ve received a message from this particular core. Therefore, messages from cores with a higher corenumber are discriminated. From the results presented above, we conclude that the network is not the dominating factor for the delay. Rather, the delay can be attributed to the workload of the cores; cores with a high workload experience higher message delays. Receiving cores may also be penalized when a core is conducting a multicast. Application design should take this into account in order to achieve the required timing precision. 8 of 11

9 (a) PULL (b) PUSH Figure 6: Delay per core without congestion IV. Conclusion This paper presented a multiple instruction, multiple data bare-metal message passing library for the Intel Single Chip Cloud Computer. The library functions for send and receive were verified using UPPAAL. In addition a mechanism to arrive at a global notion of time across the distributed cores was presented. This global notion of time is used to arrive at a synchronization of the cores. On the example of an use-case based upon a pseudo AFDX network implementation, it was shown, that the main bottleneck of the message passing performace is not the performance of the network but rather the workload of the cores. Therefore, in order to arrive at an evenly distributed transmission time between the cores, an evenly distributed load is desirable. Future work should try to optimize the workload of the cores in order to arrive at the smallest possible transmission time. Furthermore an a-priori analysis of the inter-core communication and core workloads similar to a schedulability analysis could be interesting. 9 of 11

10 (a) PULL (b) PUSH Figure 7: Delay per core with network congestion References 1 Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D. B., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P. P., Staschulat, J., and Stenström, P., The worst-case execution-time problem - overview of methods and survey of tools, ACM Trans. Embedded Comput. Syst., Vol. 7, No. 3, Labs, I., SCC External Architecture Specification (EAS), Tech. rep., Intel Corporation, May Ziwisky, M. and Brylow, D., BareMichael: A Minimalistic Bare-metal Framework for the Intel SCC, Noulard and Vernhes, 14 pp , pp , 4 Mattson, T., Riepen, M., Lehnig, T., Brett, P., Haas, W., Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., et al., The 48-core SCC processor: the programmer s view, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society, 2010, pp Mattson, T., Riepen, M., Lehnig, T., Brett, P., Haas, W., Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., et al., The 48-core SCC processor: the programmer s view presentation, Clauss, C., Lankes, S., Galowicz, J., and Bemmerl, T., ircce: a non-blocking communication extension to the RCCE communication library for the Intel Single-chip Cloud Computer, Chair for Operating Systems, RWTH Aachen University (December 17, 2010), of 11

11 7 ET International Inc., ETI SCC Bare Metal OS Development Framework: User s Manual, Tech. rep., ET International Inc., January Reble, P., Galowicz, J., Lankes, S., and Bemmerl, T., Efficient Implementation of the bare-metal Hypervisor MetalSVM for the SCC, Noulard and Vernhes, 14 pp , pp , 9 Sputh, B. H., Lukin, A., and Verhulst, E., Transparent Programming of Many/Multi Cores with OpenComRTOS: Comparing Intel 48-core SCC and TI 8-core TMS320C6678, Noulard and Vernhes, 14 pp , pp , onera.fr/marconera Larsen, K., Pettersson, P., and Yi, W., UPPAAL in a Nutshell, International Journal on Software Tools for Technology Transfer (STTT), Vol. 1, No. 1, 1997, pp Behrmann, G., David, A., and Larsen, K., A tutorial on uppaal, Formal methods for the design of real-time systems, 2004, pp Kaiser, R. and Wagner, S., Pentium Processor Family Developer s Manual Volume 3:Architecture and Programming Manual, Manual, Intel Corporation, Millberg, M., Architectural Techniques for Improving Performance in Networks on Chip, Ph.D. thesis, KTH Royal Institute of Technology, 2011,. 14 Noulard, E. and Vernhes, S., editors. ONERA, The French Aerospace Lab, July 2012, marconera of 11