A Complete Multi-Processor System-on-Chip FPGA-Based Emulation Framework
|
|
|
- Rafe Bruce
- 10 years ago
- Views:
Transcription
1 A Complete Multi-Processor System-on-Chip FPGA-Based Emulation Framework Pablo G. Del Valle, David Atienza, Ivan Magan, Javier G. Flores, Esther A. Perez, Jose M. Mendias, Luca Benini, Giovanni De Micheli DACYA/UCM, Juan del Rosal 8, Madrid, Spain. {datienza, LSI/EPFL, EPFL-IC-ISIM-LSI Station 14, 1015 Lausanne, Switzerland. {david.atienza, DEIS/Bologna, Viale Risorgimento 2, Bologna, Italy. Abstract With the growing complexity in consumer embedded products and the improvements in process technology, Multi-Processor System-On-Chip (MPSoC) architectures have become widespread. These new systems are very complex to design as they must execute multiple complex real-time applications (e.g. video processing, or videogames), while meeting several additional design constraints (e.g. energy consumption or time-to-market). Therefore, mechanisms to efficiently explore the different possible HW-SW design interactions in complete MPSoC systems are in great need. In this paper, we present a new FPGA-based emulation framework that allows designers to rapidly explore a large range of MPSoC design alternatives at the cycle-accurate level. Our results show that the proposed framework is able to extract a number of critical statistics from processing cores, memory and interconnection systems, with a speed-up of three orders of magnitude compared to cycleaccurate MPSoC simulators. I. INTRODUCTION New applications ported to embedded systems (e.g. scalable video rendering or multi-band wireless protocols) demand complex singlechip multi-processor designs to meet their real-time processing requirements while respecting other critical embedded design constraints, such as low energy consumption or reduced implementation size. Furthermore, the consumer market is reducing more and more the time-to-market and price [15]. Thus, not making feasible any longer complete redesigns of such multi-core systems from scratch. In this context, Multi-Processor Systems-on-Chips (MPSoC) have been proposed as a promising solution for all these previous problems, since they are single-chip architectures consisting of complex integrated components communicating with each other at very high speeds [15]. Nevertheless, one of their main design challenges is the fast exploration of multiple hardware (HW) and software (SW) implementation alternatives with accurate estimations of performance, energy and power to tune the MPSoC architecture in an early stage of the design process. Several MPSoC simulators have been proposed, both at transaction and cycle-accurate levels using HDL languages and SystemC. Although they achieve accurate estimations, they are limited in performance (circa Khz) due to signal management overhead. Thus, such environments cannot be used to analyze MPSoC solutions with complex embedded applications and large inputs to cover the variations in data loads at run-time. Moreover, higher abstraction levels simulators attain faster simulation speeds, but at the cost of a significant loss of accuracy. Hence, they are not suitable for finegrained architectural tuning. One solution for the speed problems of cycle-accurate simulators is HW emulation. Various MPSoC emulation frameworks have been proposed [20], [9], [1]. However, they are usually very expensive for embedded design (between $100K and $1M). Moreover, they are not flexible enough for MPSoC architecture exploration since they mainly aim at large MPSoCs prototyping or SW debugging. Typically, the baseline architectures (e.g. processing cores or interconnections) are proprietary, not permitting internal changes. In this paper, we present a new FPGA-based emulation framework that allows to explore a large range of design alternatives of complete MPSoC systems at the cycle-accurate level. Its modular architecture can extract a large range of critical statistics from three key architectural levels of MPSoC systems (i.e. processing cores, memory subsystem and interconnection mechanisms), while real-life applications are executed. Our experiments show that this framework achieves detailed cycle-accurate reports of these three levels with a speed-up of three orders of magnitude compared to state-of-the-art cycle-accurate MPSoC simulators. The remainder of the paper is organized as follows. In Section II, we summarize related work of MPSoC analysis and testing. In Section III, we present the flexible architecture of the MPSoC emulation platform and the statistics extraction mechanism. In Section IV, we present the advanced mechanism incorporated to decouple the emulation performed from the physical characteristics of the present hardware. In Section V, we detail how the HW/SW emulation process of MPSoC architectures is performed. In Section VI, we illustrate the speed and versatility of our emulation framework for MPSoC design. Finally, in Section VII, we draw our conclusions. II. RELATED WORK It is widely accepted that MPSoCs represent a promising solution for forthcoming complex embedded systems [15]. This has spurred research on modelling and prototyping MPSoC designs, both using HW and SW. From the SW viewpoint, different solutions have been suggested at different abstraction levels, enabling tradeoffs between simulation speed and accuracy. First, fast analytical models have been proposed to prune very distinct design options using high level languages (e.g. C or C++) [17], [5]. Second, transaction-level modelling in SystemC, both at the academic [18], [21] and industrial level [16], [6], have enabled more accuracy in system-level simulation at the cost of sacrificing simulation speed (circa Khz). Thus, not enabling extensive testing with large and various systems due to the too long simulation times, conversely to our emulation framework. Moreover, in most of the cases, they are only limited to a range of propietary interfaces (e.g. AMBA or LISA). Finally, important research has been done to obtain cycle-accurate frameworks in SystemC or HDL languages. In this context, companies have developed cycle-accurate simulators using post-synthesis libraries from HW vendors [12], [19]. However, their simulation speeds (10 to 50 Khz) are unsuitable for complex MPSoC exploration and validation. In the academic context, the MPARM SystemC framework presented in [3] is one of the most complete simulators for system-exploration since it includes cycle-accurate cores, complex memory hierarchies (e.g IFIP 140
2 caches, scratchpads) and interconnection mechanisms (e.g. AMBA or STBus). It can extract reliable energy and performance figures, but its major shortcoming is its simulation speed (120 Khz in a Pentium IV at 2.8 Ghz, see Section VI for further details). From the HW viewpoint, an important alternative for MPSoC prototyping and validation is HW emulation. In industry, one of the most complete sets of statistics is provided by Palladium II [20], which can accomodate very complex systems (i.e. up to 256 Mgate). However, its main disadvantages are its operation frequency (circa 1.6 Mhz) and cost (around $1 million). Then, ASIC Integrator [2] is much faster for MPSoC architectural exploration. However, its major drawback is the limitation to up to five ARM-based cores and only AMBA interconnection mechanisms. The same limitation of proprietary cores usage for exploration occurs with Heron [9]. Finally, other relevant industrial emulation approaches are System Explore [1] and Zebu-XL [8], both based in multi-fpga emulation in the order of Mhz. They can be used to validate IPs, but are not flexible enough for fast MPSoC design exploration or detailed statistics extraction. In the academic world, the most complete emulation platform up-to-date for exploring MPSoC alternatives is TC4SOC [7]. It uses a propietary 32-bit VLIW core and enables the exploration of the interconnection mechanisms and different protocols by using an FPGA to reconfigure the network interfaces. Also, [4] proposes a HW/SW emulation framework that enables the exploration of different Network-on- Chip (NoC) interconnection mechanisms. However, they do not allow designers to exhaustively extract statistics and explore the other two architectural levels we propose, namely memory hierarchy and processing cores. III. MPSOC EMULATION ARCHITECTURE The proposed MPSoC framework uses FPGA emulation as the key element to introduce SW to real HW, execute that SW on the considered MPSoC platform capable of multi-megahertz speeds, and extract detailed system statistics. An overview of the baseline HW architecture of our MPSoC emulation platform is depicted in Figure 1. It consists of four main elements from where designers can get critical statistics about MPSoC designs: 1) The evaluation (e.g. performance, generated traffic or locality of references) of different MPSoC processing cores, such as, Power PC, ARM or VLIW cores. 2) The definition of the first level of the on-chip memory hierarchy (i.e. I-cache, D-cache and scratchpad memories), as well as main memories, namely private and shared memories between processors. 3) The evaluation of different interconnection mechanisms between the first level of the memory hierarchy and the main memory (i.e. buses and NoC interconnections). 4) The obtained statistics are automatically sent to a host PC via an standard Ethernet connection thanks to our own developed statistics extraction subsystem. It includes a graphical interface running onto the host PC to analyze the information received and provide detailed reports. These elements are designed in standard and parameterizable VHDL and mapped onto a Xilinx Virtex 2 Pro vp30 board (or V2VP30) with 3M gates, which costs $2000 approximately in the market, and that includes two embedded Power PCs, various types of memories (i.e. SRAM, SDRAM and DDR) and an Ethernet port. However, any other FPGA could be used instead. The only requirements are the availability of an Ethernet core to download the statistics, a compiler for the included cores and a method to upload both the FPGA synthesis of our framework and the compiled code of Fig. 1. Overview of the HW architecture of the emulated MPSoC the application under study. In our case, Xilinx provides all these tools in its Embedded Development Kit (EDK) framework for FPGAs. In addition, remark that the purpose of our emulator is not the prototyping of final HW components in MPSoC systems, but the emulation and fast exploration for designers of desired characteristics of the eventual system. Therefore, our framework includes mechanisms to configure the exploration and hide the physical characteristics of the underlying HW that do not match the selected values, conversely to traditional prototyping. A detailed explanation of the included mechanisms is given in Section IV. In the following subsections we describe in detail the architecture and advanced emulation mechanisms of the different elements included in our emulation platform. Also, it is outlined the synthesis figures for each component. A. Processing Elements In our framework, any type of processor or coprocessor core can be included, both propietary and public ones. The accepted input forms are netlist mapping onto the underlying FPGA and HDL languages (i.e. Verilog, VHDL or Synthesizable SystemC). This addition of cores is possible since the memory controller that receives the memory requests in our system includes an external pinout interface and protocol that can be easily modified to match the respective ones of the studied processor (see Subsection III-B). Moreover, only the instruction set processing part of the core is required because its memory hierarchy (e.g. caches or scratchpad) are replaced by our framework to explore different memory configurations. In the current version of the system, we have ported a hard-core (PowerPC 405) and a soft-core (Microblaze) provided by Xilinx. None includes HDL sources, only netlist mapping, and the inclusion process for their pinout interfaces and protocols required one week. Regarding platform s scalability, it is worth to mention that a complete Microblaze requires only 4% of the total resources of our V2VP30 FPGA (574 out of slices). B. Memory Hierarchy As Figure 1 indicates, in the basic emulated architecture two memory levels presently exist: L1 cache memories and main memories. However, it is a matter of minutes to add additional cache memory levels or private memories to each processing element, either on a processor basis or by processor groups. The main element 141
3 in the memory hierarchy that enables this easy integration of new memory devices and protocols is the memory controller. One memory controller is connected to each processing core to capture all memory requests of the respective processors. Then, it forwards them to the necessary element of the memory subsystem according to the demanded memory address. In the current implementation, it takes 2% of the total available resources of our V2VP30 FPGA (270 of slices), and includes interfaces and protocols for four memory components and three different memory address ranges: 1) Private scratchpad memory, addressable by SW at a configurable memory position and with unrestricted size, as long as enough RAM resources exist. It also allows configuring latency and can be used to place instructions or data. Its synthesis takes 1% of the V2VP30 (181 of slices), apart from used RAM resources. 2) Private main memory, cacheable or non-cacheable, addressable in a configurable memory range of each processor. It is also possible to configure its size and latency. Its synthesis takes an insignificant amount of the total V2VP30 FPGA resources (apart again from the used RAM). 3) Shared main memory, cacheable or non-cacheable according to user s configuration. It is possible to configure its size and latency. Its inclusion does not take any part in the FPGA since it uses real memories (i.e. SRAM, SDRAM or DDR) available on the board. 4) Private HW-controlled data and instruction caches, transparent to the processors, and embedded before the cacheable address range of the two types of available main memories. It is possible to define independently for each of them their total sizes, line sizes and latencies to explore different design alternatives. In our experiments, both caches are direct-mapped. However, their modular designs include in different concurrent processes the replacement policy and associativity, making easy to change this configuration with additional algorithms to test. Finally, each memory controller is able to observe and keep synchronization of different clock domains, due to its multiple external interfaces. Then, the memory controller informs the Virtual Platform Clock Manager (see Section IV for more details) to stop the clock of the processor during the emulation each time one physical memory device is not able to fulfil the defined latency. Hence, the stopped processor preserves its current internal state until it is resumed by the clock manager, when the memory controller informs that the information requested is available. This abstraction layer in the platform allows us to implement the corresponding memory resources either in internal FPGA memory (optimal performance) or with external memories (bigger size), while still preserving the intended emulation parameters and balancing emulation performance and use of resources. Currently, our memory controller monitors two clock domains: one is used for the microprocessor and another one is used for the memories and the memory controller itself. C. Interconnection Mechanisms The third configurable element in our MPSoC emulation framework is the interconnection mechanism between the memory controller and the main memory (i.e. SRAM, SDRAM or DDR memories). At this level, we have included both buses and NoCs. To enable this variety of choices, apart from the flexibility in the type of interfaces of the memory controller, we have also included a configurable main memory bridge in the device side. It includes two different public pinout interfaces: one relates to the memory and another one to the instantiated interconnection. Similarly as with the memory controller, this enables us to extend the current list of available interconnection mechanisms by modifying the required pinout and protocol. In the current version, the two available buses on Xilinx FPGAs are included [10], i.e. On-Chip Peripheral Bus (OPB) for generalpurpose devices and Processor Local Bus (PLB) for fast memories and processors. Also, we have created our own 32-bit data/address bus for exploration purposes, where the bandwidth and arbitration policies can be configured. For our experiments (see Section VI) the arbitration latency is one cycle, and is connected to all processing cores through an OPB interface and to an external SRAM memory through a custom SRAM controller. However, any other algorithm and external pinout can be included. Its synthesis (including the SRAM controller) represents 1% of the V2P30 FPGA (210 of slices). In addition, we have included the possibility to explore custommade NoC solutions. The synthesizable NoC code is generated using the Xpipes NoC Compiler [13]. It allows to study topologies with any number of switches, interconnections between them with bandwidth constraints and Network Interfaces (NIs) to connect external cores to the NoC. We have modified the memory controller and the main memory bridges to be able to generate Open Core Protocol (OCP) transactions as the Xpipes NIs require [13]. Regarding FPGA utilization, a complex NoC-based system with 6 switches of 4 input/output channels and 3 output buffers uses 70% of the V2P30 FPGA (9659 of slices). Finally, to illustrate the complexity of adding new interconnection mechanisms to our framework, remark that the inclusion of each of these buses and NoC interconnection interfaces required us only one week of work. Furthermore, as with processing cores, any other highperformance proprietary bus (e.g. AMBA, STBus, etc.) can be added to our emulation framework as blackbox, since the integration process only requires to know the used protocol and external bus pinout. D. Statistics Extraction Subsystem The main feature pursued in our design of the statistic extraction subsystem is its transparent inclusion in the basic MPSoC architecture to be evaluated, and with minimum penalty in performance in the overall emulation process. For this purpose, as it is depicted in Figure 2 with labels SNIFFER 1..n, we have implemented HW sniffers that monitor certain signals of the memory controller and the external pinout of each of the devices included in the MPSoC. These sniffers are connected to a statistics manager through a dedicated bus, that processes the received information and stores the statistics in a buffer created in dedicated Block RAM memory inside the FPGA. Finally, the buffers are concurrently processed by our network dispatcher to generate MAC packets, in our own format, and send them by an Ethernet port connected to the host PC (see Figure 2). From a design point of view all sniffers in our platform share a common structure. They have a dedicated interface to capture internal signals from the module they are monitoring and a conection to our custom statistics bus. There is an skeleton available to ease the creation of new sniffers. This skeleton requires the designer to indicate separately the set of ports, signals or internal registers to be monitored in the component, and include them in a specific section of a predefined Finite State Machine (FSM). This FSM automatically creates the packets and sends the statistics of the HW sniffer to the statistics manager according to our own custom communication protocol and dedicated bus intercommunication system. Currently, we provide two diferent types of sniffers. The first one, called event-logging, exhaustively logs all interesting events that occur in 142
4 the platform. The second type of sniffers, called count-logging, are designed only to count events, such as cache misses, bus transactions, memory accesses, etc.; Thus, generating more concise results, and what tipically designers demand from cycle-accurate simulators to test their systems. Our experimental results with real-life MPSoC designs (see Section VI) indicate that, practically an unlimited number of event-counting sniffers can be added to the design without deteriorating at all the emulation speed. This establishes one of the main differences with SW cycle-accurate simulation systems: the inclusion of additional cores or analysis sniffers to the evaluated MPSoC architecture does not slow down the emulation process. Finally, as an example to see how much overhead in FPGA area the statistics extraction subsystem represents, remark that the amount of resources used by one event-logging sniffer is 0.1% (14 slices) while for an event-counting sniffer is about 0.2% (31 out of slices). IV. VIRTUAL PLATFORM CLOCK MANAGER (VPCM) Apart from the flexibility of the emulated MPSoC architecture, in order to be able to effectively validate future manufactured versions of MPSoC platforms working at various final frequencies and speeds, our emulation framework includes an additional hardware element, namely the Virtual Platform Clock Manager (VPCM), shown in Figure 2. It is the HW element used in our framework to provide multiple virtual clock domains. This module generates as output the clock signals used in the emulated MPSoC subsystems (VIRTUAL CLK signals in Figure 2). It receives two different types of input signals. First, the physical clock generated in the oscilator of the FPGA (not shown in Figure 2 for simplification purposes), which in the current implementation is set to 150 MHz. Second, one signal from each memory controller of the emulated MPSoC subsystems (VIRTUAL CLK SUPRESSION 1..N in Figure 2) used to request a virtual clock inhibition period if any attached memory device of the emulated hierarchy is not able to return the requested value at this moment respecting its set user-defined latency (see Section III-B). The use of the virtual clock domains generated by the VPCM module is two-fold in our emulation framework: First, this mechanism can be employed to perform the emulation of final MPSoC designs with different physical features than the available HW components. Once the respective VIRTUAL CLK SUPRESSION 1..N signal is risen, the corresponding VIRTUAL CLK signal of that sub-system (or the related set of subsystems) is inhibited (no clock pulses are generated). Hence, the stopped processor preserves its current internal state until its clock (VIRTUAL CLK i) is resumed by the VPCM. This occurs when the memory controller informs that the information requested is available in the accessed memory. This abstraction layer in the platform allows us to implement the corresponding memory resources either in internal FPGA memory (optimal performance) or with external memories (bigger size), while still preserving the intended emulation parameters and balancing emulation performance and use of resources. For instance, if the desired latency of the main memories in the final system is 10 cycles, but it is not feasible with the type of memory modules available for its emulation in the used FPGA (e.g. use of a large SDRAM instead of multiple SRAMs), the VPCM unit will stop the clock of the processors involved at run-time, thus hiding the additional clock cycles required by the memory. Currently, our VPCM includes two clock domains: one is used for the microprocessor, memories and interconnection mechanisms. The other one is for the memory controllers. Fig. 2. HW architecture of the statistics extraction subsystem in the proposed MPSoC emulation framework Second, the virtual clock generated by the VPCM unit for each of the components in the emulated MPSoC can be transparently stopped at run-time in case of saturation of the Ethernet connection. Then, when the congestion has disappeared (i.e. all the extracted statistics at that execution moment have been downloaded), the virtual clock of all the components can be resumed, without losing any vital information of the behaviour of the emulated MPSoC. Moreover, the combination of these two mechanisms enables the execution and cycle-accurate modeling of the emulated system at a different speed than the allowed clocked speed of the available HW components for a certain final configuration of the emulated MPSoC. In fact, it is similar to the mechanism used in SW simulations, but at a much higher frequency (see Section VI). For instance, it is possible to explore the behavior of a final system clocked at 500 MHz, even if the present cores of the FPGA can only work at 100 MHz. To this end, the designer can define the latencies of each final component, which can be easily extracted from a post-synthesis analysis before the cores are combined in the final system to be manufactured. Then, our FPGAbased framework takes care of synchronizing all the components for the correct sampling according to the defined frequency. For instance, in case of using a MPSoC with a desired virtual clock for the main memory of 500 MHz, divided by a real working frequency of 100 MHz on the FPGA, our framework will be able to stop the clock of the real execution as many cycles as needed for the slower devices to respond like if they were operating at the desired speed. V. HW AND SW MPSOC EMULATION FLOWS One key advantage of our approach for a realistic exploration of MPSoC designs at high speed is its double integration of HW and SW flows in one overall framework. An overview of these two separated emulation flows and how they are combined in a final step to create the actual emulated MPSoC architecture is shown in Figure 3. Concerning the HW flow, in the first phase it is defined the HW architecture. In this phase the user specifies one concrete architecture for each of the three main architectural levels that constitute the final 143
5 TABLE I TIMING COMPARISONS BETWEEN OUR MPSOC EMULATION FRAMEWORK AND THE CYCLE-ACCURATE MPARM SIMULATOR MPARM MPSoC Emulation Matrix (one core) 106 sec 1.2 sec Matrix (4 cores) 5 23 sec 1.2 sec Matrix (8 cores) sec 1.2 sec Dithering (4 cores-bus) 2 35 sec 0.18 Dithering (4 cores-noc) 3 15 sec 0.17 Fig. 3. HW and SW flows included in the MPSoC Emulation Framework obtained onto the screen. In case the designer wants to modify the executed application, no re-synthesis is required and in few minutes another application can be tested in the proposed emulation framework. VI. EXPERIMENTAL RESULTS MPSoC system: processing cores, memory subsystem and interconnection to the main memories. This is done by instantiating, in a plug-and-play fashion, the predefined HDL modules available in our repository for each of the previous three levels (see Section III). For a complex MPSoC architecture with 8 processors and 20 additional HW modules, this step requires few hours. In the second phase, the range of parameters to be explored at each level and included HW instance is defined. This process can be done in few minutes thanks to our use of parameters in the instantiated modules. In the third phase, the designer specifies which HW components are the targets of the statistics extraction and connects adequate sniffers to each of them. This phase requires less than one hour in real-life MPSoC designs. Finally, in the fourth phase, the whole MPSoC HW platform is synthesized with the standard tool provided by the concrete FPGA supplier used. In our case we use the Integrated Software Environment (ISE) from the EDK Xilinx toolflow [11]. The synthesis process of our framework for a complex MPSoC system (i.e. 30 elements) takes approximately half an hour for the first time it is performed, but modifications in the current configurations of some cores take no longer than some minutes to be resynthesized. Overall, the whole process of creating the HW component of a complex MPSoC architecture requires no more than few hours within our emulation framework. Related to the SW flow, the first phase is the programming of the application to be used in the final MPSoC system. In our case, the Xilinx EDK tool used for the SW phase includes GNU C (gcc) and C++ (g++) compilers/linkers for the Power PC and Microblaze cores available in our repository. Thus, if the application to be tested is already written in any of these languages, no effort (and time) is required for the designer since the memory hierarchy and the utilization of the interconnection mechanism (e.g. generation of OCP transaction for the NIs of the NoC) are transparently generated by the underlying emulated HW architecture. In the second phase, the SW is compiled and binaries are generated by the EDK framework for each processor. In case of run-time errors, the system includes ported versions of the GNU gdb debugger. The compilation of the SW part of an 8-processor emulation system only takes minutes. Finally, in the last phase of the construction of the MPSoC emulation framework, the HW and SW components are uploaded onto the Xilinx FPGA-based platform using a JTAG device. Once the configuration of the overall MPSoC platform is finished, the system runs autonomously while the statistics are concurrently extracted and sent to the host PC. Then, our provided graphical interface running onto the host captures the packets sent and displays the statistics We have assessed the performance and flexibility of the proposed emulation framework in comparison with the MPARM framework [3] by running several examples of multimedia applications in MPSoC architectures. In our experiments MPARM is executed on a Pentium IV at 3.0 Ghz with 1 GByte SDRAM and running GNU/Linux 2.6. In the first set of experiments we have evaluated the speed-ups that can be obtained by the emulation framework in comparison to cycle-accurate simulators. To this end, we have defined a simple SoC architecture composed by one Power PC core and our own bus using the protocol and latency of the AMBA bus. The memory hierarchy is composed by a L1 D-cache, I-cache and scratchpad memories, and an SRAM main memory of 32 MB. L1 memories sizes are evaluated in the range 2-16 KB, without relevant variations in the required time for the simulations/emulations performed. We have quantified the performance gains on a simple application, typical of multimedia processing, a matrix multiply processing core. Our results indicate (see Table I) that the emulation speed of our cycle-accurate emulation framework running at a real frequency of 100 MHz achieves more than 10 MHz (1.2 sec), including statistics extraction with 6 HW sniffers to cover the L1 memories, the memory controller and the traffic on the bus. However, the simulation speed of the cycle-accurate MPARM framework running on a 3 GHz processor is barely 125 KHz (106 sec). Thus, our emulation framework obtains an overall speedup of 88 while obtaining similar statistics as MPARM, e.g. total cycles spent by the processor, memory/scratchpad accesses, blockread/writes, simple read/writes, etc. (see Table II for a summary). Similar results were obtained with the soft-core Microblaze instead of the Power PC, showing that the inclusion of netlist softcores does not affect the performance of our emulation system, since they are not in the critical path of the virtual platform clocks. The second set of experiments has allowed us to evaluate how the emulated system scales when complex MPSoC architectures are explored. In this case, we have tested various configurations of interconnection mechanisms. We have instantiated a system with 4 cores (1 PowerPC and 3 Microblazes) and a complex L1 hierarchy for each core with 4 KB for D-cache and I-cache, 4 KB scratchpad and 16 KB of private memory. Then, two global main memories are shared between all processors of 1 MB using the OPB, OCP and our own bus. This whole system consumes 66% of the available space in the V2VP30 and runs at a platform frequency of well beyond 100 MHz. Next, we have explored the use of NoCs via the Xpipes Compiler [13] to replace the shared bus. The tested NoC configurations were composed of 2 32-bit switches with 4 inputs/outputs and 3-package buffers. The use of such a NoC topology with a sniffer in the 144
6 TABLE II EXCERPT OF STATISTICS EXTRACTED WITH THE EMULATION FRAMEWORK Metric Matrix (1 core) Processor cycles Memory reads 0 Memory block-reads Memory writes Scratchpad reads Scratchpad writes 0 Cache read hits Cache read misses Fig. 4. Execution speedups of our MPSoC Emulation framework compared to the execution time of the cycle-accurate MPARM simulator emulation system required 30%, using in the end 80% of our board for the whole MPSoC system. As SW driver, we have used a simple dithering filtering (around 300 lines of C code) using the Floyd algorithm [14] in two 128x128 grey images, divided in 4 segments and stored in the shared memories. This application is highly parallel and imposes almost the same workload in each processor, only slightly more in the last processor, which is in charge of merging the obtained results. This second set of experiments showed that the emulation system scales significantly better than SW simulation (see Table I). In fact, the exploration of MPSoC solutions with 4 cores (more than 30 HW components in total) and multiple bus-based interconnections, took 0.18 seconds (at 100 MHz) in the emulation platform, but 155 seconds in MPARM (at 125 KHz), resulting in a final speedup of approximately 860 (see Figure 4 and Table I). Moreover, the exploration of NoC interconnections imposes more overhead in cycleaccurate simulators (see Table I) than in our emulation platform due to the additional signals to manage, which enables even more speedups. In this case, the evaluation of each NoC configuration to try to find the design that optimally uses the available bandwitdh to the two main shared memories avoiding the congestion present in the bus-based solutions, required even less time in the emulated system (0.17 seconds), while more than three minutes in MPARM (due to the overhead of cycle-accurate signal management). As a result, our HW- SW emulation framework achieved an overall speed-up of more than three orders of magnitude (1140 ), illustrating its clear benefits for the exploration of the design space of complex MPSoC architectures compared to cycle-accurate simulators. VII. CONCLUSIONS MPSoC architectures have been proposed as a possible solution to tackle the complexity of forthcoming embedded systems. These new systems are very complex to design since they have to be able to run various complex applications (e.g. video processing, or videogames), while meeting several additional design constraints, such as real-time performance or energy consumption. Hence, new frameworks that enable designers to validate and explore the behavior of their MPSoC design solutions executing the aforementioned software applications in a fast and cycle-accurate way are required. In this paper we have presented a new FPGA-based emulation framework that enables the rapid extraction of a large range of statistics at three different architectural levels of MPSoC designs, i.e. processing cores, memory subsystem and interconnection mechanisms. The experimental results have shown that our proposed framework obtains detailed reports with an speed-up of three orders of magnitude compared to cycle-accurate MPSoC simulators. Moreover, the addition of more processing cores and more complex memory architectures in our emulation framework suitably scales. Thus, almost no loss in emulation speed occurs, conversely to cycle-accurate simulators. VIII. ACKNOWLEDGMENTS The authors would like to thank Federico Angiolini for his help with XPIPESCOMPILER. This work is partially supported by the Spanish Government Research Grant TIN and a Mobility Post-Doc Grant from UCM for David Atienza. REFERENCES [1] Aptix System explore, [2] ARM integrator AP, [3] L. Benini, et al. Mparm: Exploring the MPSoC design space with SystemC. Journal of VLSI, September [4] N. Genko, et al. A Complete Network-On-Chip Emulation Framework. In Proc. of DATE, [5] G. Braun, et al. Processor/memory co-exploration on multiple abstraction levels. In Proc. of DATE, [6] CoWare. Convergensc and LisaTek product lines, coware.com. [7] M. Diaz Nava, et al. An open platform for developing MPSoCs. IEEE Computer, pp , July [8] Emulation and Verification Engineering. Zebu Xl and ZV models, [9] H. Engineering. Heron mpsoc emulation, hunteng.co.uk. [10] Xilinx Enterprise. Xilinx Virtex-II Pro FPGA and IP components descriptions, products/v2pro/xc v2pro43.htm. [11] Xilinx Enterprise. Xilinx Embedded Development Kit (EDK), docs.htm. [12] M. Graphics. Platform express and primecell, mentor.com/. [13] A. Jalabert, et al. xpipescompiler: A tool for instantiating application specific NoC. In Proc. DATE, [14] R. W. Floyd, L. Steinberg. An adaptive algorithm for spatial gray scale. In Proc. of ISDT, [15] A. Jerraya and W. Wolf. Multiprocessor Systems-on-Chips. Morgan Kaufmann, Elsevier, [16] ARM. PrimeXSys platform architecture and methodologies, white paper. Technical report, [17] P. Mishra, et al. Proc.-mem. co-exploration driven by an architectural description language. In Proc. ICVLSI, [18] P. G. Paulin, et al. Stepnp: A system-level exploration platform for network procs. IEEE D & T of Computers, [19] Synopsys. Realview Maxsim ESL environment, synopsys.com/. [20] Cadence Palladium II, [21] A. Wieferink, et al. A generic toolset for SoC multiprocessor debugging and synchronization. In Proc. ASAP,
7a. System-on-chip design and prototyping platforms
7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit
Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de [email protected] NIOS II 1 1 What is Nios II? Altera s Second Generation
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Multiprocessor System-on-Chip
http://www.artistembedded.org/fp6/ ARTIST Workshop at DATE 06 W4: Design Issues in Distributed, CommunicationCentric Systems Modelling Networked Embedded Systems: From MPSoC to Sensor Networks Jan Madsen
Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy
Hardware Implementation of Improved Adaptive NoC Rer with Flit Flow History based Load Balancing Selection Strategy Parag Parandkar 1, Sumant Katiyal 2, Geetesh Kwatra 3 1,3 Research Scholar, School of
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
MPSoC Virtual Platforms
CASTNESS 2007 Workshop MPSoC Virtual Platforms Rainer Leupers Software for Systems on Silicon (SSS) RWTH Aachen University Institute for Integrated Signal Processing Systems Why focus on virtual platforms?
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
Introduction to System-on-Chip
Introduction to System-on-Chip COE838: Systems-on-Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
Kirchhoff Institute for Physics Heidelberg
Kirchhoff Institute for Physics Heidelberg Norbert Abel FPGA: (re-)configuration and embedded Linux 1 Linux Front-end electronics based on ADC and digital signal processing Slow control implemented as
What is a System on a Chip?
What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex
Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001
Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering
Model-based system-on-chip design on Altera and Xilinx platforms
CO-DEVELOPMENT MANUFACTURING INNOVATION & SUPPORT Model-based system-on-chip design on Altera and Xilinx platforms Ronald Grootelaar, System Architect [email protected] Agenda 3T Company profile Technology
Simplifying Embedded Hardware and Software Development with Targeted Reference Designs
White Paper: Spartan-6 and Virtex-6 FPGAs WP358 (v1.0) December 8, 2009 Simplifying Embedded Hardware and Software Development with Targeted Reference Designs By: Navanee Sundaramoorthy FPGAs are becoming
ESE566 REPORT3. Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU
ESE566 REPORT3 Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU Nov 19th, 2002 ABSTRACT: In this report, we discuss several recent published papers on design methodologies of core-based
Optimizing Configuration and Application Mapping for MPSoC Architectures
Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : [email protected] 1 Multi-Processor Systems on Chip (MPSoC) Design Trends
Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor
Von der Hardware zur Software in FPGAs mit Embedded Prozessoren Alexander Hahn Senior Field Application Engineer Lattice Semiconductor AGENDA Overview Mico32 Embedded Processor Development Tool Chain HW/SW
LogiCORE IP AXI Performance Monitor v2.00.a
LogiCORE IP AXI Performance Monitor v2.00.a Product Guide Table of Contents IP Facts Chapter 1: Overview Target Technology................................................................. 9 Applications......................................................................
Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip
Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Cristina SILVANO [email protected] Politecnico di Milano, Milano (Italy) Talk Outline
WiSER: Dynamic Spectrum Access Platform and Infrastructure
WiSER: Dynamic Spectrum Access Platform and Infrastructure I. Seskar, D. Grunwald, K. Le, P. Maddala, D. Sicker, D. Raychaudhuri Rutgers, The State University of New Jersey University of Colorado, Boulder
From Bus and Crossbar to Network-On-Chip. Arteris S.A.
From Bus and Crossbar to Network-On-Chip Arteris S.A. Copyright 2009 Arteris S.A. All rights reserved. Contact information Corporate Headquarters Arteris, Inc. 1741 Technology Drive, Suite 250 San Jose,
A Generic Network Interface Architecture for a Networked Processor Array (NePA)
A Generic Network Interface Architecture for a Networked Processor Array (NePA) Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh EECS @ University of California, Irvine Outline Introduction
Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin
BUS ARCHITECTURES Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin Keywords: Bus standards, PCI bus, ISA bus, Bus protocols, Serial Buses, USB, IEEE 1394
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS
AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS David Rupe (BittWare, Concord, NH, USA; [email protected]) ABSTRACT The role of FPGAs in Software
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Open Flow Controller and Switch Datasheet
Open Flow Controller and Switch Datasheet California State University Chico Alan Braithwaite Spring 2013 Block Diagram Figure 1. High Level Block Diagram The project will consist of a network development
OpenSoC Fabric: On-Chip Network Generator
OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation
Switched Interconnect for System-on-a-Chip Designs
witched Interconnect for ystem-on-a-chip Designs Abstract Daniel iklund and Dake Liu Dept. of Physics and Measurement Technology Linköping University -581 83 Linköping {danwi,dake}@ifm.liu.se ith the increased
Getting Started with Embedded System Development using MicroBlaze processor & Spartan-3A FPGAs. MicroBlaze
Getting Started with Embedded System Development using MicroBlaze processor & Spartan-3A FPGAs This tutorial is an introduction to Embedded System development with the MicroBlaze soft processor and low
Computer System Design. System-on-Chip
Brochure More information from http://www.researchandmarkets.com/reports/2171000/ Computer System Design. System-on-Chip Description: The next generation of computer system designers will be less concerned
Network connectivity controllers
Network connectivity controllers High performance connectivity solutions Factory Automation The hostile environment of many factories can have a significant impact on the life expectancy of PCs, and industrially
Reconfigurable System-on-Chip Design
Reconfigurable System-on-Chip Design MITCHELL MYJAK Senior Research Engineer Pacific Northwest National Laboratory PNNL-SA-93202 31 January 2013 1 About Me Biography BSEE, University of Portland, 2002
Applying the Benefits of Network on a Chip Architecture to FPGA System Design
Applying the Benefits of on a Chip Architecture to FPGA System Design WP-01149-1.1 White Paper This document describes the advantages of network on a chip (NoC) architecture in Altera FPGA system design.
Embedded System Tools Reference Manual
Embedded System Tools Reference Manual EDK [optional] Xilinx is disclosing this user guide, manual, release note, and/or specification (the Documentation ) to you solely for use in the development of designs
A Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software RTOS
A Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software Jaehwan Lee, Kyeong Keol Ryu and Vincent John Mooney III School of Electrical and Computer Engineering Georgia
White Paper. S2C Inc. 1735 Technology Drive, Suite 620 San Jose, CA 95110, USA Tel: +1 408 213 8818 Fax: +1 408 213 8821 www.s2cinc.com.
White Paper FPGA Prototyping of System-on-Chip Designs The Need for a Complete Prototyping Platform for Any Design Size, Any Design Stage with Enterprise-Wide Access, Anytime, Anywhere S2C Inc. 1735 Technology
Embedded Development Tools
Embedded Development Tools Software Development Tools by ARM ARM tools enable developers to get the best from their ARM technology-based systems. Whether implementing an ARM processor-based SoC, writing
Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:
55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................
Notes and terms of conditions. Vendor shall note the following terms and conditions/ information before they submit their quote.
Specifications for ARINC 653 compliant RTOS & Development Environment Notes and terms of conditions Vendor shall note the following terms and conditions/ information before they submit their quote. 1.
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
2. TEACHING ENVIRONMENT AND MOTIVATION
A WEB-BASED ENVIRONMENT PROVIDING REMOTE ACCESS TO FPGA PLATFORMS FOR TEACHING DIGITAL HARDWARE DESIGN Angel Fernández Herrero Ignacio Elguezábal Marisa López Vallejo Departamento de Ingeniería Electrónica,
Pre-tested System-on-Chip Design. Accelerates PLD Development
Pre-tested System-on-Chip Design Accelerates PLD Development March 2010 Lattice Semiconductor 5555 Northeast Moore Ct. Hillsboro, Oregon 97124 USA Telephone: (503) 268-8000 www.latticesemi.com 1 Pre-tested
Offline HW/SW Authentication for Reconfigurable Platforms
Offline HW/SW Authentication for Reconfigurable Platforms Eric Simpson Virginia Tech [email protected] Patrick Schaumont Virginia Tech [email protected] Abstract Many Field-Programmable Gate Array (FPGA) based
Implementation Details
LEON3-FT Processor System Scan-I/F FT FT Add-on Add-on 2 2 kbyte kbyte I- I- Cache Cache Scan Scan Test Test UART UART 0 0 UART UART 1 1 Serial 0 Serial 1 EJTAG LEON_3FT LEON_3FT Core Core 8 Reg. Windows
Extending Platform-Based Design to Network on Chip Systems
Extending Platform-Based Design to Network on Chip Systems Juha-Pekka Soininen 1, Axel Jantsch 2, Martti Forsell 1, Antti Pelkonen 1, Jari Kreku 1, and Shashi Kumar 2 1 VTT Electronics (Technical Research
Seamless Hardware/Software Performance Co-Monitoring in a Codesign Simulation Environment with RTOS Support
Seamless Hardware/Software Performance Co-Monitoring in a Codesign Simulation Environment with RTOS Support L. Moss 1, M. de Nanclas 1, L. Filion 1, S. Fontaine 1, G. Bois 1 and M. Aboulhamid 2 1 Department
Design and Verification of Nine port Network Router
Design and Verification of Nine port Network Router G. Sri Lakshmi 1, A Ganga Mani 2 1 Assistant Professor, Department of Electronics and Communication Engineering, Pragathi Engineering College, Andhra
HowHow to Get Rid of Unwanted Money
On-Chip Evolution Using a Soft Processor Core Applied to Image Recognition Kyrre Glette and Jim Torresen University of Oslo Department of Informatics PO Box 1080 Blindern, 0316 Oslo, Norway {kyrrehg,jimtoer}@ifiuiono
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Extending the Power of FPGAs. Salil Raje, Xilinx
Extending the Power of FPGAs Salil Raje, Xilinx Extending the Power of FPGAs The Journey has Begun Salil Raje Xilinx Corporate Vice President Software and IP Products Development Agenda The Evolution of
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
A General Framework for Tracking Objects in a Multi-Camera Environment
A General Framework for Tracking Objects in a Multi-Camera Environment Karlene Nguyen, Gavin Yeung, Soheil Ghiasi, Majid Sarrafzadeh {karlene, gavin, soheil, majid}@cs.ucla.edu Abstract We present a framework
RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition
RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition A Tutorial Approach James O. Hamblen Georgia Institute of Technology Michael D. Furman Georgia Institute of Technology KLUWER ACADEMIC PUBLISHERS Boston
FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab
FPGA Accelerator Virtualization in an OpenPOWER cloud Fei Chen, Yonghua Lin IBM China Research Lab Trend of Acceleration Technology Acceleration in Cloud is Taking Off Used FPGA to accelerate Bing search
Electronic system-level development: Finding the right mix of solutions for the right mix of engineers.
Electronic system-level development: Finding the right mix of solutions for the right mix of engineers. Nowadays, System Engineers are placed in the centre of two antagonist flows: microelectronic systems
How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet
How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet Professor Jiann-Liang Chen Friday, September 23, 2011 Wireless Networks and Evolutional Communications Laboratory
Digital Systems Design! Lecture 1 - Introduction!!
ECE 3401! Digital Systems Design! Lecture 1 - Introduction!! Course Basics Classes: Tu/Th 11-12:15, ITE 127 Instructor Mohammad Tehranipoor Office hours: T 1-2pm, or upon appointments @ ITE 441 Email:
Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education
Lesson 7: SYSTEM-ON ON-CHIP (SoC( SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY 1 VLSI chip Integration of high-level components Possess gate-level sophistication in circuits above that of the counter,
Testing of Digital System-on- Chip (SoC)
Testing of Digital System-on- Chip (SoC) 1 Outline of the Talk Introduction to system-on-chip (SoC) design Approaches to SoC design SoC test requirements and challenges Core test wrapper P1500 core test
White Paper Increase Flexibility in Layer 2 Switches by Integrating Ethernet ASSP Functions Into FPGAs
White Paper Increase Flexibility in Layer 2 es by Integrating Ethernet ASSP Functions Into FPGAs Introduction A Layer 2 Ethernet switch connects multiple Ethernet LAN segments. Because each port on the
Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng
Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption
Eingebettete Systeme. 4: Entwurfsmethodik, HW/SW Co-Design. Technische Informatik T T T
Eingebettete Systeme 4: Entwurfsmethodik, HW/SW Co-Design echnische Informatik System Level Design: ools and Flow Refinement of HW/SW Systems ools for HW/SW Co-Design C-based design of HW/SW Systems echnische
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Going Linux on Massive Multicore
Embedded Linux Conference Europe 2013 Going Linux on Massive Multicore Marta Rybczyńska 24th October, 2013 Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 2 Agenda Architecture
Accelerate Cloud Computing with the Xilinx Zynq SoC
X C E L L E N C E I N N E W A P P L I C AT I O N S Accelerate Cloud Computing with the Xilinx Zynq SoC A novel reconfigurable hardware accelerator speeds the processing of applications based on the MapReduce
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps
10/100/1000Mbps Ethernet MAC with Protocol Acceleration MAC-NET Core with Avalon Interface
1 Introduction Ethernet is available in different speeds (10/100/1000 and 10000Mbps) and provides connectivity to meet a wide range of needs from desktop to switches. MorethanIP IP solutions provide a
A Mixed-Signal System-on-Chip Audio Decoder Design for Education
A Mixed-Signal System-on-Chip Audio Decoder Design for Education R. Koenig, A. Thomas, M. Kuehnle, J. Becker, E.Crocoll, M. Siegel @itiv.uni-karlsruhe.de @ims.uni-karlsruhe.de
Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and
Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and Simulink targeting ASIC/FGPA. Previously Worked as logic
Concurrent Hardware/Software Development Platforms Speed System Integration and Bring-Up
Concurrent Hardware/Software Development Platforms Speed System Integration and Bring-Up Author: Ran Avinun, Cadence Design Systems, Inc. Hardware/software development platforms such as virtual prototyping,
Serial port interface for microcontroller embedded into integrated power meter
Serial port interface for microcontroller embedded into integrated power meter Mr. Borisav Jovanović, Prof. dr. Predrag Petković, Prof. dr. Milunka Damnjanović, Faculty of Electronic Engineering Nis, Serbia
Codesign: The World Of Practice
Codesign: The World Of Practice D. Sreenivasa Rao Senior Manager, System Level Integration Group Analog Devices Inc. May 2007 Analog Devices Inc. ADI is focused on high-end signal processing chips and
System on Chip Platform Based on OpenCores for Telecommunication Applications
System on Chip Platform Based on OpenCores for Telecommunication Applications N. Izeboudjen, K. Kaci, S. Titri, L. Sahli, D. Lazib, F. Louiz, M. Bengherabi, *N. Idirene Centre de Développement des Technologies
MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance
MPSoC Designs: Driving Storage Management IP to Critical Importance Design IP has become an essential part of SoC realization it is a powerful resource multiplier that allows SoC design teams to focus
DDS. 16-bit Direct Digital Synthesizer / Periodic waveform generator Rev. 1.4. Key Design Features. Block Diagram. Generic Parameters.
Key Design Features Block Diagram Synthesizable, technology independent VHDL IP Core 16-bit signed output samples 32-bit phase accumulator (tuning word) 32-bit phase shift feature Phase resolution of 2π/2
Real-time Processor Interconnection Network for FPGA-based Multiprocessor System-on-Chip (MPSoC)
Real-time Processor Interconnection Network for FPGA-based Multiprocessor System-on-Chip (MPSoC) Stefan Aust, Harald Richter Department of Computer Science Clausthal University of Technology Julius-Albert-Str.
All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule
All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:
Tensilica Software Development Toolkit (SDK)
Tensilica Datasheet Tensilica Software Development Toolkit (SDK) Quickly develop application code Features Cadence Tensilica Xtensa Xplorer Integrated Development Environment (IDE) with full graphical
Designing Systems-on-Chip Using Cores
Designing Systems-on-Chip Using Cores Reinaldo A. Bergamaschi 1, William R. Lee 2 1 IBM T. J. Watson Research Center, Yorktown Heights, NY, 2 IBM Microelectronics, Raleigh, NC [email protected], [email protected]
Rapid System Prototyping with FPGAs
Rapid System Prototyping with FPGAs By R.C. Coferand Benjamin F. Harding AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Newnes is an imprint of
Multistage Interconnection Network for MPSoC: Performances study and prototyping on FPGA
Multistage Interconnection Network for MPSoC: Performances study and prototyping on FPGA B. Neji 1, Y. Aydi 2, R. Ben-atitallah 3,S. Meftaly 4, M. Abid 5, J-L. Dykeyser 6 1 CES, National engineering School
Chapter 2 Heterogeneous Multicore Architecture
Chapter 2 Heterogeneous Multicore Architecture 2.1 Architecture Model In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater fl exibility, it is
Product Development Flow Including Model- Based Design and System-Level Functional Verification
Product Development Flow Including Model- Based Design and System-Level Functional Verification 2006 The MathWorks, Inc. Ascension Vizinho-Coutry, [email protected] Agenda Introduction to Model-Based-Design
FPGA Prototyping Primer
FPGA Prototyping Primer S2C Inc. 1735 Technology Drive, Suite 620 San Jose, CA 95110, USA Tel: +1 408 213 8818 Fax: +1 408 213 8821 www.s2cinc.com What is FPGA prototyping? FPGA prototyping is the methodology
International Journal of Scientific & Engineering Research, Volume 4, Issue 6, June-2013 2922 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 6, June-2013 2922 Design and Verification of a Software Defined radio platform using Modelsim and Altera FPGA. Barun Sharma,P.Nagaraju,Krishnamurthy
Video Conference System
CSEE 4840: Embedded Systems Spring 2009 Video Conference System Manish Sinha Srikanth Vemula Project Overview Top frame of screen will contain the local video Bottom frame will contain the network video
