Network Architecture Design Exploration and Simulation on High Speed Camera System using SynDEx

Transcription

1 Network Architecture Design Exploration and Simulation on High Speed Camera System using SynDEx Eri Prasetyo W. Antonius Irianto. S Nurul Huda Djati K. Michel P. Doctoral Program of Information Technology Gunadarma University, Indonesia Faculty of Industrial Engineering Gunadarma University, Indonesia KPK, Indonesia Gunadarma University, Indonesia LEAD, Burgundy University, France {eri, irianto}@staff.gunadarma.ac.id Abstract Nowadays, embedded multi-processor development that concentrated on vision machine, such a camera system, still has enthusiasm. This paper extracted specific multi-processor network interconnection design features from an architectural point of view. Two types of network correlated to our design ring and coherent interconnection network had expressed. SynDEx is used to simulate and find out the best fit architecture of multiprocessor network. The extracted features will be used to modify our network design continuing our previous research result on fps pixel 64x64 sensors. Keywords: multi-processor, Heyrman, ring, tile, Syndex 1 Introduction Recently improvements is continue to be made in the growing digital camera system with the CMOS technology. CMOS has the main advantage on ability to integrate processing element with the sensors pixel level instead of CCD. It means that the usage of CMOS has major opportunity to be made easily on single chip, today we often find the words System on Chip (SoC). As mentioned in [1], advanced in CMOS technology has enable multi-processor system on chip (MPSoC) devices to be built. MPSoCs provide high computing power in an energyefficient way, making them ideal for multimedia consumer applications. Camera system is part of its application. An MPSoC consists of Processing Elements (PE). For scalability reasons we envision that in the near future MPSoCs will include a Network-on-Chip (NoC) for communication between PEs, as in example [2]. Many network PEs architecture design has been introduced by other researchers. Hakduran Koc [3] proposed new method on data fetching from memory in embedded multiprocessor and Daewook Kim [4] also concerned on shared memory multiprocessor. Klaus Hermann [5] proposed a new distributed embedded DRAM within multi-processor system. Amerijckx [6] introduced the architecture of a new embedded field programmable processor array (E-FPPA). The interconnection network that has been selected for the E-FPPA is hierarchical ring architecture. Another design that show architecture in more general is proposed by Baghdadi [7]. Generic feature is depicted by its Modularity, Flexibility and Scalability. An abstract representation of the multiprocessor system considered in our research is explained by [8]. The design has increased speed but still has a bottleneck possibility on its network design. This paper exhibits this problem and explores some other alternative network design to overcome the bottleneck and to give other advantages. SynDEx [9] is a free academic system level CAD tool, meaning Synchronized Distributed Execution. SynDEx is developed in INRIA Rocquencourt, France. It supports the AAA methodology (Adequation Algorithm Architecture) for distributed real-time processing. SynDEx provides a timing graph which includes simulation results of the distributed application and thus enables SynDEx to be used as a virtual prototyping tool. The extracted features of the solution will be used to modify our network design and to continue our previous research result on fps 64x64 pixel sensors.

2 2 Pre-considered Network Architecture Our consideration on network architecture is inspired by Bart [8]. This multi-processor design model is shown on figure 1. Memories scheduling algorithm and PEs selection algorithm would be difficult to implement because of pseudo-assembler code dependant. A way to overcome the latency problem in single bus would be the use of crossbar architecture. Unfortunately, crossbars are not scalable and the implementation cost is high. 3 Alternative Network Architecture Figure 1. Modeled Architecture [8] The choice of his network of PEs is a partial crossbar shown on figure 2. Network that interconnects PEs is an important part of multiprocessing system. Indeed, if we are not able to efficiently provide the data to processing elements, or if the link between sensor and PE array is a bottleneck, the whole system will suffer from an important loss of performance. To overcome disadvantages of Bart design [8], Figure 1 and figure 2, we explore some alternative network architecture to meet the best-fit of our design. First, we explain network architecture based on ring topology proposed by [6]. Second, we summarize the CAKE project feature that adopting coherent interconnect network [10]. 3.1 Ring As mentioned before, Americjkx [6] has introduced new architecture of processors interconnection using ring topology. See figure 3. Figure 2. Simple schema of PE [8] This is the main concern on our refinement to this modeled architecture. Some disadvantages also arise from this model. These disadvantages are : The estimation of size of silicon surface would be larger enough when using AMS 0.35 µm CMOS technology. See Figure 1, there is bottleneck possibility on the link between MUX and Network, and the link between MUX and Memory, because of the huge data transmission that fulfills the bus. Figure 3. Ring Network Architecture [6] Block (B) is composed of embedded processor, its data memory, and its program memory and directly connected to the transfer controller (TC). In this architecture, each block (B) is connected to a ring of level-(i) by a transfer controller (TC) which handles all interface between the block and the ring network. Each level-(i) ring is connected to a level-(i-1) by an inter-ring transfer controller which manages the transfer between rings. This kind of network design has many advantages. Some of its main advantages are:

3 Ravindran et al. [11] have proven that small hierarchical rings are much more efficient than mesh of higher dimension. As mentioned in [12], one of the main advantages of this architecture is its high scalability. The small point-to-point connections allowing to work at a very high frequency. Moreover, these networks and their performances are well known [13] Anyhow, the main disadvantage of this ring is that only one block can use the ring at a time. This mechanism leads to low network utilization. That is why americjkx extracted the performance comparison of token ring, slotted ring and register insertion ring. 3.2 Coherent Interconnect Network Coherent interconnect network proposed by CAKE (Computer Architecture for a Killer Experience) Project [10]. CAKE project suggests implanting a regular structure of communicating tiles (the uniform clusters). Each tile can be configured to execute a set of tasks. The details of a suitable inter-tile communication infrastructure is a two-dimensional torus, see figure 4. Figure 5 depicts a typical tile design. The blocks labeled SPF represent the special purpose hardware functions that are key to the computational efficiency. There are multiple memory banks to increase the concurrency and improve throughput. All communication with other tiles on the chip is done by the router. The NIC is the network interface controller, responsible for the communication protocol. Figure 4. Homogeneous Network of tiles [10] This network has some advantages such as the architecture has a high scalability of processors Figure 5. Typical Architecture of a Tile [10] and memories and also each tile contains a share of the CPU connected to a share of the memory that allow us to increase efficiency of memory utilization. The tested process network algorithm had been proven using YAPI [14]. Our limited exploration has found that its disadvantages are: The size of the tiles should be small enough so that they do not suffer too much from long wiring. But the tiles should be large enough to host a significant number of hardware functions to achieve high levels of computational efficiency on a wide range of applications. Need to have a good mechanism and a special treatment on spreading the traffic via NICs. This architecture has a possibility to flood the local NIC when the number of local CPU and local memory increased. Moreover, it would be a bottleneck on it. 4 Simulation Result Each network processor elements architecture are modeled and simulated using SynDEx. 4.1 Heyrman Architecture The Heyrman multi-processor network [8] as seen on figure 6 consists of input block (input memory and input from image), MUX, network, processor elements, RAM, and output memory. After compose the Main Algorithm, the next step is make Main architecture block. It is a block where operators and communication media exist, so they can communicate to each other. The algorithm and the architecture are connected by a software component. The simulation result can be seen on timing graph that shown on figure 7

4 Figure 6. Main Algorithm Window Figure 9. Ring Multi Processors Network Timing Graph Figure 7. Timing Graph of Heyrman Multiprocessors Network Architecture 4.2 Ring Architecture Ring multi processor network architecture is modeled as seen on figure 8. This Algorithm consists of a mux, IRTC(Inter-Ring Transfer Conrol), node that consists of processor elements and Transfer Control (TC) as seen on Figure11, and output memory. All node connected to its neighbor and IRTC in ring configuration. Operators and communication media are communicating in Main Architecture. Thus the timing graph is shown on figure Coherent Interconnection Network This type of architecture is also known as Tile Architecture. Tile architecture is modeled using 4 processor elements that will receive same amount of data sent by a router as seen on Figure 10. This router works as data transfer control for the processor elements. If the processor element is in idle condition, router will send data from memory to the processor element through a unit delay for synchronization need. The data that has been proceded is sent to register and memory. The timing graph of this multi processor elements network simulation can be seen on figure 11. Figure 10. Tile Multi Processor Network Main Algorithm 5 Conclusion Figure 8. Ring Multi Processor Network Main Algorithm In this article, some multi-processor networks are described modeled and simulated using Syn- DEx software. By using this simulation method, the most important and complicated parts of multi-processor network development, such as

5 Figure 11. Graph Multi Processors Network Timing the distribution of code for different processors, or synchronization between computation and communication are all implemented by the SynDEx tool, and the automatic code could be generated automatically with the help of the necessary kernels. The Code can be used as a program to run FPGA. References [1] Gerard J. M. Smit Pierre G. Jansen Maarten H. Wiggers, Nikolay Kavaldjiev. Architecture design space exploration for streaming applications through timing analysis. Proceedings of Communicating Process Architectures (WoTUG-28), pages , [2] Pierre G. Jansen Nikolay Kavaldjiev, Gerard J. M. Smit. A virtual channel router for onchip networks. Proceedings of IEEE International SOC Conference, pages , September [3] Ehat Ercanli Ozcan Ozturk Hakduran Koc, Mahmut Kandemir. Reducing offchip memory access costs using data recomputation in embedded chip multiprocessors. ACM,DAC, 48, june [4] Manho Kim Daewook Kim and Gerald E. Sobelman. Dcos: Cache embedded switch architecture for distributed shared memory multiprocessor socs [5] Jrg Hilgenstock Peter Pirsch Klaus Herrmann, Sren Moch. Implementation of a multiprocessor system with distributed embedded dram on a large area integrated circuit. Proceedings IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT), October [6] J.-D. Legat C. Amerijckx. A low-power multiprocessor architecture for embedded reconfigurable systems [7] D. Lyonnard A.A. Jerraya A. Baghdadi, N- E. Zergainoh. Generic architeture platform for multiprocessor system-on-chip design [8] Renaud Schmit Laurent Letellier Thierry Colletteb Barthelemy Heyrman, Michel Paindavoine. Smart camera design for intensive embedded computing. Real-Time Imaging, 11:282289, [9] C. Lavarenne T. Grandpierre and Y. Sorel. Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors. CODES 99, pages 74 78, [10] Paul Stravers and Jan Hoogerbugge. Single- Chip Multiprocessing for Consumer Electronics, Domain-Specific Processors Systems, Architectures, Modeling, and Simulation [11] M. Stumm G. Ravindran. A performance comparison of hierarchical ring- and meshconnected multiprocessor network. In Proceedings of HPCA97,, pages 58 69, [12] P. K. McKinley L. M. Ni. A survey ofwormhole routing techniques in direct networks. IEEE Computer, pages 62 76, February [13] W. J. Dally. Performance analysis of k- ary ncube interconnection networks. IEEE Transactions on Computers, 39(6): , June [14] Essink G. Smits W. J. M. van der Wolf P. Brunel J.-Y. Kruijtzer W. M. Lieverse P. Vissers K. A. De Kock, E. A. Yapi: Application modeling for signal processing systems. Proceedings of the 37th Design Automation Conference, 2000.