Issues with designing a dual core processor with a shared L2 cache on a Xilinx FPGA board

Similar documents
Getting Started with Embedded System Development using MicroBlaze processor & Spartan-3A FPGAs. MicroBlaze

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Simplifying Embedded Hardware and Software Development with Targeted Reference Designs

7a. System-on-chip design and prototyping platforms

Pre-tested System-on-Chip Design. Accelerates PLD Development

Kirchhoff Institute for Physics Heidelberg

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

EDK Concepts, Tools, and Techniques

Embedded System Tools Reference Manual

Lecture 2: Computer Hardware and Ports.

Post-Configuration Access to SPI Flash Memory with Virtex-5 FPGAs Author: Daniel Cherry

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Discovering Computers Living in a Digital World

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Building an Embedded Processor System on a Xilinx Zync FPGA (Profiling): A Tutorial

DS1104 R&D Controller Board

Notes and terms of conditions. Vendor shall note the following terms and conditions/ information before they submit their quote.

MicroBlaze Tutorial Creating a Simple Embedded System and Adding Custom Peripherals Using Xilinx EDK Software Tools

21152 PCI-to-PCI Bridge

Lab 1: Introduction to Xilinx ISE Tutorial

Reconfigurable System-on-Chip Design

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Chapter 13. PIC Family Microcontroller

Network connectivity controllers

MicroBlaze Debug Module (MDM) v3.2

Model-based system-on-chip design on Altera and Xilinx platforms

Technical Product Specifications Dell Dimension 2400 Created by: Scott Puckett

Open Flow Controller and Switch Datasheet

SBC6245 Single Board Computer

Design and Verification of Nine port Network Router

FPGA Design From Scratch It all started more than 40 years ago

USB 3.0 Connectivity using the Cypress EZ-USB FX3 Controller

USB - FPGA MODULE (PRELIMINARY)

The Advanced JTAG Bridge. Nathan Yawn 05/12/09

Chapter 12. Development Tools for Microcontroller Applications

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Embedded System Tools Reference Guide

S7 for Windows S7-300/400

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Accelerate Cloud Computing with the Xilinx Zynq SoC

Digital Systems Design! Lecture 1 - Introduction!!

Design Cycle for Microprocessors

Microtronics technologies Mobile:

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

System Performance Analysis of an All Programmable SoC

Soft processors for microcontroller programming education

Introducción. Diseño de sistemas digitales.1

LogiCORE IP AXI Performance Monitor v2.00.a

PowerPC 405 GP Overview

Getting Started with the Xilinx Zynq All Programmable SoC Mini-ITX Development Kit

SOC architecture and design

MATLAB/Simulink Based Hardware/Software Co-Simulation for Designing Using FPGA Configured Soft Processors

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Chapter 2 Logic Gates and Introduction to Computer Architecture

Straton and Zenon for Advantech ADAM Copalp integrates the straton runtime into the ADAM-5550 device from Advantech

Xeon+FPGA Platform for the Data Center

TEST CHAPTERS 1 & 2 OPERATING SYSTEMS

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Vivado Design Suite Tutorial: Embedded Processor Hardware Design

- Nishad Nerurkar. - Aniket Mhatre

Chapter 1 Computer System Overview

Reminders. Lab opens from today. Many students want to use the extra I/O pins on

9/14/ :38

STLinux Software development environment

The Motherboard Chapter #5

MICROPROCESSOR AND MICROCOMPUTER BASICS

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Architectures and Platforms

Rapid System Prototyping with FPGAs

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

System on Chip Platform Based on OpenCores for Telecommunication Applications

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

Implementation Details

Computer Systems Structure Input/Output

Implementation of Web-Server Using Altera DE2-70 FPGA Development Kit

Linux. Reverse Debugging. Target Communication Framework. Nexus. Intel Trace Hub GDB. PIL Simulation CONTENTS

Modeling Registers and Counters

Università Degli Studi di Parma. Distributed Systems Group. Android Development. Lecture 1 Android SDK & Development Environment. Marco Picone

Trace Port Analysis for ARM7-ETM and ARM9-ETM Microprocessors

Cisco MCS 7825-H2 Unified CallManager Appliance

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

LiveDesign Evaluation Board Technical Reference Manual. Technical reference manual for Altium s LiveDesign Evaluation Boards

Low Cost System on Chip Design for Audio Processing

Computer Organization and Components

Chapter 5 Cubix XP4 Blade Server

Chapter 1 Lesson 3 Hardware Elements in the Embedded Systems Chapter-1L03: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

CSE467: Project Phase 1 - Building the Framebuffer, Z-buffer, and Display Interfaces

Quartus II Software Design Series : Foundation. Digitale Signalverarbeitung mit FPGA. Digitale Signalverarbeitung mit FPGA (DSF) Quartus II 1

Accessing I2C devices with Digi Embedded Linux 5.2 example on Digi Connect ME 9210

The Central Processing Unit:

OpenSPARC T1 Processor

IOVU-571N ARM-based Panel PC

85MIV2 / 85MIV2-L -- Components Locations

Lab 1: Full Adder 0.0

NORTHEASTERN UNIVERSITY Graduate School of Engineering

Transcription:

Issues with designing a dual core processor with a shared L2 cache on a Xilinx FPGA board V Bhanu Chandra (Y3383), Varun Sharma (Y3393) {vbhanu,varunsh}@cse.iitk.ac.in Project Supervisors: Dr. Mainak Chaudhuri, Dr. Rajat Moona {mainakc,moona}@cse.iitk.ac.in Department of Computer Science and Engineering Indian Institute of Technology Kanpur Abstract Dual core processors provide better performance than single core processors because of the support of the thread level parallelism. We discuss the issues with designing a dual core processor with a private L1 cache and shared L2 cache on an FPGA board. 1. Problem Statement We discuss the issues involved with designing a MicroBlaze based dual core processor on Xilinx FPGA board. The first level cache is implemented and is private to each core. So, we assess the problems involved with designing the second level cache which would be shared by the two processors. 2. Introduction As observed by the Moore s law the transistor density is very rapidly increasing. The performance that can be extracted from a single core has neared it s threshold. So, in this situation people are looking at multi-core processor designs to improve the overall performance of the system. A multi-core microprocessor is one which combines two or more independent processors into a single package, often a single integrated circuit (IC). A dual-core device contains only two independent microprocessors. Very often the applications exhibit thread level parallelism (TLP) and multi-core microprocessors can enhance the overall performance of the system by exploiting the TLP without including multiple microprocessors in separate physical packages. This form of TLP is often known as chip-level multiprocessing, or CMP. International Business Machines(IBM) s POWER4 was the first dual core module processor that came out in 2000. The most recent advancement in this field is Intel s Centrino Duo Mobile Technology [2] which is being used in Centrino notebook. 3. Simulation Environment Tools 3.1. Xilinx Platform Studio(XPS) and Embedded Development Kit(EDK) EDK [1] is an integrated software solution for designing embedded processing systems and implementing on a Xilinx FPGA device. The components of the Xilinx EDK are: Hardware (Intellectual Property) for the Xilinx Embedded Processors and their peripherals.

Drivers, Libraries and a Micro Kernel for Embedded Software Development. Software Development Kit(SDK), Eclipse based IDE. GNU compiler and debugger for C development for MicroBlaze and PowerPC. XPS provides an integrated environment for creating software and hardware specification flows for embedded processor systems based on MicroBlaze and PowerPC processors. It provides a graphical system editor for connection of processors, peripherals and buses. The salient features of XPS are: Ability to add cores, edit core parameters, and make bus and signal connections to generate an MHS(Microprocessor Hardware Specification) file. Ability to generate and modify the MSS(Microprocessor Software Specification) file. Support for the following tools: Xilinx SDK Base System Builder(BSB) Wizard The create and import IP wizard Platform generator Library Generator Xilinx Microprocessor Debugger(XMD) Platform Specification Utility Ability to generate and view system design report. Process and tool flow dependency management. 3.2. Virtex 4 FPGA board We used the Virtex 4 FPGA board and the ML403 Evaluation Platform[4] for implementing our designs. Some of the salient features of the board are: 64 MB DDR SDRAM, 32 bit interface running up to 266-MHz data rate. General purpose LEDs and push buttons. RS-232 serial port. One 4Kb IIC EEPROM. PS/2 mouse and keyboard connectors. JTAG configuration port for use with Parallel Cable III and Parallel Cable IV cable. System ACE and Compact Flash Connector. 3.3. Integrated Software Environment(ISE) ISE is a design software suite. ISE can be used to design in different source types such as HDL, Schematic Design files, State Machines and IP cores. HDL sources may be synthesized using Xilinx Synthesis Technology(XST). The Project Navigator interface provides a visual design manager for the entire process.

3.4. Hardware Used We are using a Virtex 4 ML403 Evaluation Platform. The other hardware components include: RS232 cable for the I/O. This is connected to the RS232 port on the board on one end and serial port on the other end. JTAG adapter for hardware debugging and also downloading the bitstream on to the board. 4. Design 4.1. Design of the processor The processor that we have is a highly configurable 32-bit MicroBlaze[3] softcore. MicroBlaze processor uses Big endian bit-reversed format to represent the data. The L1 cache is implemented, it is private to the processor. If L1 cache is needed then it can simply be enabled. The size of the L1 cache is configurable, while the minimum size of the cache is 2kB. Figure 1 MicroBlaze Architecture The processor is a master on an OPB [5] (on-chip peripheral bus) and also on the LMB (local memory bus) each of which addresses an address range. When the OPB bus is added in the system the OPB arbiter is generated by the XPS. When the processor does not find certain data in the first level cache (L1 cache is private to the processor), a complete request for that data is sent on to the OPB bus which has access to that address range. The local memory bus is connected to the local memory which is a small memory made of BRAM blocks. It is a high speed bus and BRAM memory is also very efficient. Hence, this memory can be used to host small programs which would improve the performance of the program. 4.2. System Design The fact that OPB is a multi master bus allows one to connect both the cores to the bus and then also have the cache controller on the same bus. Optionally one can enable the L1 cache in the processor, this would be private to the processor. If the cache is enabled in the processor then the L2 cache mechanism would be the second level cache else it would act as the first level

cache. The L2 cache could be implemented in the BRAM blocks and the controller is also a peripheral which is connected to the OPB bus. The processors would generate request for certain action at a certain address location by setting the signals on the OPB. The respective device on the OPB bus would identify the request being generated and process it as needed. When the request is for certain data, the L1 cache controller, private to each processor, checks whether the data is in the L1 cache else a complete request is sent on to the OPB bus. This could then be observed by the L2 cache controller and then if that data is present in the L2 cache then that data is returned else the data should be fetched from the DDR SDRAM and the data be sent on the OPB bus. This data is then sent to the processor, caching it in L1 on the way. The same data should be cached in the L2 cache by the cache controller. Figure 2 Final Design The processor puts the address from which the data has to be fetched on the Bus2IP Addr port and at every clock cycle this is examined and if the appropriate Read signals are set and the data is available in the cache then it is put on the bus. The port of the controller which has the data is IP2Bus Data. If the data is not available in the cache then the data is fetched from the DDR SDRAM by setting the DDRRAM IP2Bus Addr to the address from which the data is to be fetched. The data can be obtained by reading the port DDRRAM Bus2IP Data. 5. Implementation We have created a design which has two processors, each with it s own local memory and two separate programs each of which prints a separate string. Each program is loaded into the local memory corresponding to one of the cores and is executed on the corresponding core. The expected and the actual output is a jumbled string.

Figure 1 MicroBlaze Architecture Due to a bug in the 8.2i version of the software it was not possible to create a peripheral which could be coded in verilog. We have found a workaroud for this, the workaround requires one to change the MPD and PAO files generated by the software. 5.1. Cache Implementation Focus of our work so far has been to implement a second level cache for a single processor. The idea is to implement the cache using BRAM blocks and the controller written in verilog. The IPIC (intellectual property interconnect) makes it simpler for one to address the signals that are needed to access data from BRAM blocks and also the information on OPB is more easily addressable. The idea is to have separate BRAM blocks for cache and for the tags. The address sent onto the OPB by the processor can be received by the cache controller and then sent to the BRAM to see if the corresponding data is available. If it is then the data is sent to the processor else the data is obtained from the DDR memory and written to the cache before it is sent to the processor. The data read from DDR SDRAM is read in units of eight words and the cache is write through. The cache that we have discussed and implemented has the following structure: 1. Six tag bits are associated with each cache line. 2. Ten bits have been allocated for the index. 3. Five bits have been taken for the block offset. First 11 bits are 0 6 tag bits 10 index bits 5 block offset bits Note that with the above structure we would need 32KB of space for cache and 2KB for the tag, half of which is for the state bits. The tags are stored in a BRAM block and the cache data is stored in 16 BRAM blocks. Hence we take 17 BRAM blocks. In the tag BRAM block the first line is assigned to the tag and the second one is given to the corresponding state bits. This continues, so half of the BRAM block is taken by the tags while the other half is assigned to the state bits.

6. Issues with the design As soon as our cache controller was compiled, we were stuck with a problem in our initial design. The address generated by the processor for getting the data belonged to the DDRRAM. Since the cache controller was also a peripheral attached to the same bus, it had a different address range. Hence it cannot capture the signals of the DDRRAM. Also we cannot have an overlap of addresses for the cache controller and the DDRRAM. So there was no way, we could run the system with a fully functioning L2 cache. We had overlooked this problem in it s initial stages hoping that we could have different disjoint address spaces for the cache controller and the DDRRAM. The idea was to configure the processor to generate addresses in the range of the controller and then translate the address that comes to the controller before sending it to the DDRRAM. Later we discovered that it is not possible to change the whole set of addresses generated by the processor since the processor would always have instructions which would require it to address the absolute address and not just the offset. 7. Alternate Designs We discussed the following designs after we realised that the earlier design does not work: 1. This design had 2 OPB buses. The Microblaze processor and other peripherals(except DDRRAM) were attached on one of the bus and the DDRRAM was attached on the second OPB bus. The L2 cache controller was slave on the first bus and master on the second. The idea was that all the signals that are directed towards the DDRRAM will have to pass through the cache controller and thus here the cache can come into the picture if it has the requested data. Design with multiple OPB buses 2. In the next design, we thought of removing the DDRRAM from the system. Instead we can use the BRAMs in the L2 cache controller to emulate the DDRRAM. The existing BRAMs attached to the L2 cache controller can be broken into two parts. Some acting as a DDRRAM and the others acting as L2 cache. The access to the first part can be delayed using delay signals while the other part can be fast as BRAMs so that it looks like a faster memory, which a cache should be.

Design with DDR emulated in BRAM Each of the above designs met with problems because of which they cannot be implemented either. The first design again heads into the problem of two different peripherals having the same address space while the second one does not work due to the limited number of BRAM blocks on the system. The system has a total of 36 BRAM blocks, 17 of which are being used by the cache, assuming that we use a BRAM block each for first level cache for each of the processors we would be left with 17 BRAM blocks to emulate DDRRAM. Thus due to the limited number of BRAM blocks this design also has to be discarded. 8. Suggested Solutions As we discussed these problems we came to believe that the following approaches would help solve the problems that have been mentioned above: 1. The first design change that was proposed to overcome the problems address above are to use the user defined processor. The advantage of using our own processor is that we can generate the addresses as we require. So we can add a translator in the processor and the addresses which refer to DDR memory will be appropiately mapped to the address range in L2 cache controller. Then we can even do away with 2 buses and all the IPs would be attached to the same bus. 2. The second design proposal is to add the functionality of a standard IBM bridge protocol in the L2 cache controller connected to both the buses. It will be slave on the bus that is connected the processor and master on the second bus. The bus protocol requires only an address range that has to be mapped to the second bus. It has no address range of its own. If the L2 cache controller has this protocol built in, it wont overlap with DDR. It can respond to the data request if the data is present in cache in a tranparent manner which is the functionality of the cache. 9. Conclusion The implementation of an L2 cache is not possible in the current scenario because of software restrictions. The L2 cache and the DDR cannot have the same address range because both are IP attached on the bus and their address range cannot overlap. Therefore a need for more flexible system is required. This may be done using a user defined processor which generates addresses for L2 cache or it can be done using OPB2OPB bridge. Using a bridge will help in communicating across 2 buses. The bridge can also act like a L2 cache controller which is connected to BRAMs. It can capture the signals and reply on its own based on whether it has the data or not. And if it does not have the data, it can request the DDR to furnish the data. For testing the correctness of the current user logic of the L2 cache controller, either of the designs needs to be implemented. Various issues have to be kept in mind while designing some of which are: 1. Failure of the software to make proper entries into the system. 2. Behavior of the processor or the bridge should be dependent on the OPB bus protocol. 3. While using the bridge, the cache controller should behave in a transparent manner.

We have listed a number of issues that we faced while designing the system and the methods to overcome them. These issues should be considered while implementing a fully functional dual core system with a shared L2 cache. References [1] Embedded System Tools Reference Manual Webreference http://www.xilinx.com/ise/embedded/est rm.pdf [2] Intel Technology Journal, Intel Centrino Duo Mobile Technology Volume 10, Issue 02, May 15,2006 Webreference http://www.intel.com/technology/itj/2006/volume10issue02/art01 intro to core duo/p01 abstract.htm [3] MicroBlaze Processor Reference Guide Webreference http://www.xilinx.com/ise/embedded/mb ref guide.pdf [4] ML40x EDK Processor Reference Design Webreference http://www.xilinx.com/bvdocs/userguides/ug082.pdf [5] OPB IPIF Architecture Webreference http://www.xilinx.com/ipcenter/catalog/logicore/docs/opb ipif.pdf