}w!"#$%&'()+,-./012345<ya

Transcription

1 }w!"#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS Using low cost FPGAs for realtime video processing MASTER S THESIS Filip Roth Brno, spring 2011

2 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Advisor: prof. Ing. Václav Přenosil, CSc ii

3 Acknowledgement I would like to thank IBSmm Engineering for allowing parts of my work to be published as this thesis and also for being understanding and supporting during my studies. I would like to thank my advisor, prof. Ing. Václav Přenosil, CSc, for helpfulness and guidance during writing of this thesis. Last but not least, I would like to thank my family and friends for their support during my studies. iii

4 Abstract This thesis describes the use of current day low cost Field Programmable Gate Arrays (FPGAs) for realtime broadcast video processing. Capabilities of selected device family (Altera Cyclone IV) are discussed with regard to video processing. Example IP cores (deinterlacer, alpha blender and frame rate converter) are designed in Verilog HDL and the design flow is described. The IP cores are implemented in real hardware system. The overall hardware system is described, together with individual FPGA components providing video input/output and other I/O functions. iv

5 Keywords FPGA, video processing, deinterlacing, alpha blending, frame rate conversion, Verilog, HDL, hardware design flow v

6 Contents 1 Preface Introduction History of Field Programmable Gate Arrays Present day FPGAs Programmable logic Routing resources Embedded memory Embedded multipliers Development software Future possibilities of FPGAs Video processing on an FPGA Broadcast video transport standards Parallel digital data Serial digital interface (SDI) Digital Video Interface (DVI) Project requirements Video deinterlacing Low latency On-screen display generation Video stream switching Image capture Device family selection Design requirements Altera Cyclone family Xilinx Spartan family Lattice Semiconductor Corporation Final selection Evaluation of commercial IP cores from Altera Video and Image Processing Suite (VIP) DDR2 High Performance Controller II NIOS II soft processor Selected system structure Block diagram Camera video input Frame buffer USB link Deinterlacer

7 7.6 PC video input Alpha blender Example video processing cores Deinterlacer Algorithm overview Line duplication Line interpolation Weave algorithm Motion adaptive algorithm Algorithm selection Principle of operation Implementation Alpha blender Principle of operation Implementation Frame buffer Principle of operation Implementation FPGA design flow Separate projects for custom components Use standard interfaces Optimize the memory access pattern SignalTap II logic analyzer Horizontal and vertical device migration Physical I/O pin mapping Resulting hardware Verification of the hardware Conclusion Bibliography A Pin placement B Device floorplan C Slack histogram D Contents of the enclosed CD

8 Chapter 1 Preface This work originates in author s work as a hardware developer at IBSmm Engineering [1], a hardware design house located in Brno, Czech Republic. The video processing IP (intellectual property) cores presented in this work are part of a video processing device developed at IBSmm. Due to this fact, the subject of this thesis is a commercial product into which significant time and effort was invested. As it happens in the industry, the IBSmm Engineering management is not that willing to release the entire product documentation, including board design files, firmware sources, intellectual property sources for the programmable logic or schematic documentation to the public domain. Therefore, a decision was made to make public only selected parts of the design, demonstrating the approaches and algorithms used to accomplish the required functions, but not the entire source codes or project files. For this reason, this work describes the overall system only in brief detail and full description is given only to the IP cores developed by the author for providing the video processing functions of the system. The IP core source codes are available on the enclosed CD and in the online archive at Masaryk University, each as a separate Quartus II 9.1SP2 project. The entire FPGA design is the original work of the author of this thesis together with FPGA pins assignments, timing constraints and major part of the resulting hardware board schematic (some blocks in the board schematic were reused from earlier projects and were not done by the author). The complete project documentation is available upon request, provided that the requestor signs an NDA with IBSmm Engineering. It is hoped by the author, that, despite these limitations, this work will give useful information to readers interested in video processing on an FPGA and also provide a real world demonstration of the development of a product using these technologies. 3

9 Chapter 2 Introduction Nowadays, as the requirements for processing power of the embedded systems are growing, many systems are starting to use FPGAs for offloading the processing functions traditionally done by the embedded CPU or ASIC. This was made possible by the advancements in chip manufacturing technology as described by Moore s law[2], where programmable logic device parameters such as density, processing power, power consumption and cost improved to become viable alternatives to the traditional approaches. Additionally, a design using programmable logic offers specific advantages over other approaches, mainly the possibility to alter the configuration of the hardware in the field (hence the name), which is a very useful feature considering problems like bug fixes and frequent needs to modify the design after the product is finished. Of course, this flexibility comes at a premium compared to a dedicated hardened CPU or ASIC, usually both in terms of power consumption and unit price. However, especially for small production series, the flexibility of programmable logic may more than balance the additional cost of the device; the CPU may not be exactly suited to the application and the ASIC development costs may be well out of bounds of the estimated product volumes. With the gradual transition of video signal representation from analog signals like VGA and SCART to the digital domain, programmable logic started to provide the processing functions where required. With its inherently parallel nature, these devices are well suited for algorithms requiring high bandwidth and the calculation of many operations in parallel on the video data. 2.1 History of Field Programmable Gate Arrays The history of Field Programmable Logic Arrays (FPGAs) began with the advent of programmable logic arrays (PLAs)[3]. From today s point of view, these devices were relatively simple and were used mainly as glue logic, merging several discrete combinational logic ICs into one chip. The programmable logic devices progressed hand in hand with the advancements in IC manufacturing technology and architecture theory, and in 1985 the first commercially viable Field Programmable Gate Array was developed by Xilinx, Inc[4]. From that point, the FPGAs started to grow both in density and capabilities, 4

10 2. INTRODUCTION another vendors emerged and the devices started to be used in more market segments than the initial networking and telecommunications areas. For more indepth overview of FPGA history, please see [4]. 2.2 Present day FPGAs Nowadays, FPGAs are a standard off-the-shelf components, ranging in size and capabilities. Usually, the FPGA is composed of configurable logic, routing resources, embedded memory, multipliers and a range of hardened peripheral interfaces. Not physically present in the FPGA, but from the design standpoint an integral part of the device design flow, is the FPGA development software Programmable logic The programmable logic is composed of LUTs (look up tables, sometimes also called LEs - logic elements), which are SRAM based cells performing user defined function given by the FPGA configuration bitstream. The exact LUT structure varies by manufacturer and device family, for illustration shown here is the LUT structure of Altera Cyclone IV device family. Figure 2.1: Altera Cyclone IV LUT structure[5] 5

11 2. INTRODUCTION Routing resources To enable connections between logic elements themselves and between logic elements and any other parts of the chip, the FPGA contains the interconnect. These are hardened connection paths inside the chip, either general purpose for user design or with a specific function, e.g. clock distribution networks. The clock distribution paths are designed in such a way as to provide uniform clock distribution with minimal skew over all parts of the chip. This is an important part of the interconnect fabric, since most FPGA designs are synchronous and the quality of clock distribution directly affects the maximum frequency at which the user design can properly function (this maximum frequency is usually called fmax). Also, this is the part of an FPGA occupying the most silicon resources of the chip. Some estimates quote up to 90% of the silicon die is dedicated to routing[7] Embedded memory Many of the FPGA designs require some kind of fast memory for temporary storage of intermediate results, data buffers and other. For this reason, the chip contains embedded memory blocks. These are hardened SRAM memory units, usually configurable for different memory sizes, data widths or single/dual port access Embedded multipliers Since FPGAs are well suited for digital signal processing (DSP), most device families contain hardened multipliers. This provides the designer with optimized blocks with higher performance (fmax) than soft (in logic) implementations and also frees up logic resources, which would otherwise be needed to implement the multiplier function. The DSP blocks are usually fixed point, newer and high end FPGA families implement hardened floating-point-optimized components[6] Development software An important part of the FPGA development is the design software. This software package provides the designer with interface to all FPGA design stages, from design entry to programming of the configuration memory. This software is responsible for transferring the user design to a selected physical device and its structure while meeting the user requirements for design timing (timing constraints). Contrary to the software world, where the compilation times are relatively small and the iterative development method cycle is short, a larger FPGA design can take several hours to compile. The compiler must analyze the design, convert the algorithms into device-specific blocks and fit the resulting netlist into the selected device fabric. When the design uses a large portion of the device resources 6

12 2. INTRODUCTION or has high requirements for maximum frequency, this is a computing challenge even for modern processors (Intel Sandy Bridge CPU i7-2600k@3.4ghz compiles the design described in this work in 9 minutes, although on a single core; and this is a relatively small design). For this reason, appropriate hardware is necessary for the development. Figure 2.2: Example FPGA development software[20] 2.3 Future possibilities of FPGAs Currently, the fastest performing FPGA is probably from the device family Speedster22 from Achronix[8]. Since the major performance limiting factor in current FPGAs is the interconnect, the Achronix device avoids this bottleneck by time multiplexing the routing resources. By doing this, the Speedster22i device is capable of providing 1.5GHz peak processing performance. Since today s highend is tomorrow s lowend in the semiconductor industry, we may see a rapid increase in processing power of even low cost FPGAs in the coming years. The discovery of memristor[9] may be an important step towards developing new generation FPGAs. HP is currently developing a memristor based FPGA[10]. The standard PC architecture may also include elements of FPGA fabric in the future or be entirely replaced by programmable logic. This is signified by the Intel Stellarton[11] CPU, which includes an Intel Atom processor together with an Altera Arria II FPGA die in a single package. The FPGA is currently used as an H264 encoding accelerator. 7

13 2. INTRODUCTION 2.4 Video processing on an FPGA Processing a video stream usually involves operations on either the video signal timing or on the raw bitmap data of individual frames or fields. The FPGA architecture is well suited for video processing for the following reasons: Video timing generation is relatively straightforward with an FPGA. Even the logic fabric of low cost FPGA families is usually capable of supporting 150+ MHz IP components, therefore allowing generation of HD resolutions. Processing the raw frame data can take advantage of the hardened DSP blocks to ease the timing requirements for the logic fabric itself. Together with pipelining the individual algorithm operations, this allows the design of complex video processing paths even with HD resolutions. By being close to metal, the algorithms on an FPGA can be more effective in terms of power than systems using an CPU core to perform the processing functions. Due to the FPGA flexibility, the video processing path can be tailored to specific project requirements. The flexibility of the FPGA architecture may prove useful for small production series, where the development costs of ASIC solution may be prohibitive. For these reasons, the processing functions required for the project described in this work were implemented on an FPGA. 8

14 Chapter 3 Broadcast video transport standards Today, with a few exceptions (e.g. the VGA interface), the video signal representation transitioned from analog to digital domain. The most obvious advantage of digital representation over analog is that the video data is not in any way altered by the transmission. With analog representation, this was not possible due to effects like noise and line losses, which in most cases corrupted the transmitted information. Regardless of the selected video interface standard, video data is divided into discrete images called frames. A frame is an bitmap image, transferred over the transport interface from top to bottom line by line, with each image line being transmitted from left to right. Therefore, the transmission of a frame starts with top left pixel and ends with bottom right pixel. The rate at which the video frames are transferred is called a frame rate. Figure 3.1: Progressive frame structure[12] 9

15 3. BROADCAST VIDEO TRANSPORT STANDARDS The video format can be either progressive or interlaced. In progressive video stream, a frame is transferred in whole, meaning it is a complete representation of the video image in one point in time. With interlaced stream, frames are divided in halves called fields. Fields can be either odd or even, where odd field contains odd lines of the frame and even field contains even lines. When the stream is transferred as interlaced video, the motion appears smoother because this format effectively doubles temporal resolution of the stream (compared to a progressive stream with the same resolution and bandwidth). Figure 3.2: Interlaced frame structure[12] The video data represent the scene in some predefined color space. The most commonly used color spaces are RGB and YCbCr. With RGB color space, the pixel has red, green and blue component to identify it s color. The RGB standard is widely used in the PC industry for video data representation and as graphics card output format. When using YCbCr color space, the pixel has luminance (brightness) and chrominance (color) coordinates to identify the color. Conversion between these color spaces can be from straightforward to fairly complex, depending on the requested conversion quality. The horizontal and vertical resolution of the frame, frame rate, color space and progressive/interlaced identifier together form a video format. Video formats are standardized by organizations such as VESA[16] or SMPTE[17]. This chapter gives an overview of video transport standards used for video input and output of the presented video processing system. 10

16 3. BROADCAST VIDEO TRANSPORT STANDARDS Figure 3.3: YCbCr color space at 0.5 luminance[13] 3.1 Parallel digital data The representation of video data as a parallel clocked bus is most common when connecting different integrated circuits on a printed circuit board. The bus contains a master clock signal, horizontal and vertical synchronization signals, active picture indicator (data valid signal) field identifier for interlaced formats and the video data itself. This format with separate horizontal and vertical synchronization is most commonly used, probably for its universality. Although embedded synchronization can be used (synchronization signals are not separate wires but are embedded as special sequences directly in the video data), this may cause design complications when using video processing ICs which each expect differing embedded synchronization sequences because of differing standards (e.g. BT656 vs BT1120). The parallel transmission format requires that the appropriate individual bit wires have their lengths closely matched to each other to ensure that the pixel wavefront is properly aligned at the receiver side. With today s high resolutions and therefore high pixel clock rates, this data format may also cause problems with signal crosstalk or reflections from impedance differences, therefore it is a good practice to use some kind of termination at both the transmitter and receiver sides. 11

17 3.2 Serial digital interface (SDI) 3. BROADCAST VIDEO TRANSPORT STANDARDS Serial digital interface[18] is a video transport standard used mainly in broadcast and medical industries. It uses shielded coaxial cable as a medium and allows for transfer rates from 270Mbit/s to 3Gbit/s. It can be thought of as an serial encapsulation of parallel digital data. On the transmitting side, the data is serialized to a high speed serial form and on the receiving side data is deserialized back to parallel format. Figure 3.4: An example of SDI connector[14] SDI uses NRZI encoding scheme to encode data and a linear feedback shift register to scramble the data to control bit disparity. The video stream can also include CRC (Cyclic Redundancy Check) checksums to verify that the transmission occurred without an error. 3.3 Digital Video Interface (DVI) DVI is an interface to transfer digital video and is used frequently in the PC industry. The interface uses TMDS (Transition Minimized Differential Signaling) to transfer data over four twisted pairs (three for data and one for clock) of wires. Because this interface is frequently used to connect a graphics port of a computer to a display, DVI also includes support data channels to allow the computer to identify the device being connected. This interface is called EDID and is basically a serial EEPROM with information about the display vendor and supported resolutions. This interface can be also thought of as an serial encapsulation of parallel data, but compared to SDI it uses three serial data channels to transport the data. This reduces bandwidth requirements for a single serial channel and therefore reduces the quality requirements for used cabling. 12

18 Chapter 4 Project requirements This chapter describes the various requirements for the processing hardware. The device using the FPGA video processor is to be used in an medical environment for displaying live video from endoscopic cameras during surgeries. The system also has to be able to record the video and store the feed either locally or via network, but these functions are handled by a standard x86 system embedded in the device and as such are not the topic of this work. 4.1 Video deinterlacing Based on customer requirements, the video processor must handle two input video formats, one progressive and one interlaced video feed. This requirement comes from the fact that with this system, a HD camera will be usually delivered which has two settings for output video resolution, 720p and 1080i. Since the customer wants to be able use a standard monitor (most of which do not handle interlaced video timings very well), the 1080i interlaced video must be internally converted to 1080p. This video format can be displayed on a standard monitor with no timing problems. 4.2 Low latency The system is to be used for live video display during surgical operations. The device processes the video signal from endoscope which is then output on a monitor. The surgeon navigates by the displayed video image and so the processing delay must be as small as possible. If the delay was too large, the surgeon would see the operating tool later than he or she may do a critical intervention to the patient and would therefore be a hazardous behavior. 4.3 On-screen display generation When displaying live video from the endoscopic camera, the system also has to mix into the picture some additional information. This information includes patient name, system settings, buttons for touch controls if the attached monitor has a touch panel and an indicator of free space available for the recorded video. 13

19 4. PROJECT REQUIREMENTS 4.4 Video stream switching One of the features that the customer requested was the ability to display both the live video feed and an administrative GUI application running on the system on a single monitor. From this stems the requirement to switch between two video streams seamlessly, not to cause the attached display monitor to resynchronize to a new timing should the transition be made by a simple switch. 4.5 Image capture The system must be able to take snapshots of the displayed video feed. Although this could be handled by the embedded x86 system in a similar way as the video recording, because of another request by the customer that the captured image be freezed for a few seconds for a surgeon to see what the picture is, it was decided that this function will be handled by the hardware. 14

20 Chapter 5 Device family selection This chapter discusses the selection of FPGA device family to realize the required functions of the system. After preliminary tests of video processing components on a separate board developed for said testing, it was concluded that even low cost FPGA families from major manufacturers were sufficient to implement Full HD video processing. Based on this conclusion, the family selection was limited to low cost field programmable gate arrays. FPGA families are usually divided into several generations, each generation contains devices with varying sizes and features and each device is manufactured in various packages and speed grades. 5.1 Design requirements The project requirements described in previous chapter were transferred to design requirements for the FPGA chip performance and required peripheral functions. Since the design seemed to most likely require a frame buffer component, some form of large temporary memory was needed to store the incoming video frames. It was decided that the system will use DDR2 memory for it s relatively low cost and sufficient performance. Based on the incoming video formats specified by the customer, the required memory bandwidth for the frame buffer was estimated (in bytes): 1920(width) 540(height) 60(fps) 4(Bpp) 2(R + W ) = 474MB/s Including a margin for read/write bank switching and memory refresh cycles, it was concluded that a single DDR2 x16 chip fulfills this bandwidth requirement, since (in bytes): 2(datawidth) 2(ef f ectiveperclock) (f requency) = 800M ib/s Therefore, the target device must be able to instantiate a DDR2 x16 memory controller core to interface to the external DDR2 x16 memory chip. The total number of pins required was estimated to be in the range of 150 to 180. This included two video inputs, USB link connection, DDR2 memory interface and support I/O functions of the FPGA. 15

21 5. DEVICE FAMILY SELECTION The maximum frequency required for any part of the design was estimated to be 150MHz-180Mhz for the most demanding components. Namely the DDR2 memory interface and the deinterlacer module. The selection of FPGA device family was based on these requirements together with a preference of wide availability and good online support. 5.2 Altera Cyclone family Altera manufactures low cost FPGA chips under the Cyclone family name. This family includes devices from 3000 logic elements (LEs) to about LEs. The FPGA chips of this family also contain up to several megabits of embedded memory blocks, multipliers for DSP processing and are offered in a range of package sizes and pin counts. The Cyclone family supports the instantiation of an DRAM memory device controllers. Figure 5.1: Altera Cyclone IV family logo[5] The Cyclone family is currently divided into four generations, Cyclone I to Cyclone IV (as of time of writing of this work, the Cyclone V family is announced by Altera with available samples, but mass production of this family is planned to 2012). These generations differ in power consumption, densities, supported peripheral features and the maximum frequency the logic fabric of the device is able to support for a given HDL design. The family generation selection was reduced to Cyclone III and Cyclone IV. These families are more advanced than the I/II generations, due to advances in lithographic processes are cheaper and have better availability. Also, due to the Cyclone IV being basically a shrink of Cyclone III, the conversion of a given design between these families is a relatively simple task. 5.3 Xilinx Spartan family The other major manufacturer of Field Programmable Gate Arrays, Xilinx Inc., offers device families with similar features as Altera. The Xilinx version is branded under the name Xilinx Spartan. The Spartan devices are also divided into device generations based on advancements in FPGA design. The device families considered were Spartan-6 and Spartan-3 due to a relatively large community support for designs based on these devices. The FPGA chips from the Spartan-6 device family include hardened memory controller blocks for interfacing an external DRAM memory chip. 16

22 5. DEVICE FAMILY SELECTION 5.4 Lattice Semiconductor Corporation Lattice Semiconductor is the third largest FPGA manufacturer and although it was also taken into consideration, for a perceived lack of good online support the devices from Lattice Semi were not given any further evaluation. 5.5 Final selection The device family selected to implement the requested functions of the system was Altera Cyclone III/IV. This decision was influenced by several factors. The low cost FPGA devices from Altera are on par with low cost devices offered by Xilinx when comparing features like price, performance, capabilities and package options. Since the selected manufacturer will probably be also used in future projects requiring some form of FPGA processing, availability of IP cores was taken into account. Since the company is trying to enter into medical video processing market, it is necessary to have video processing cores available. Although there exist many for the Xilinx devices, Altera offers a complete package for video processing, the Altera Video and Image Processing Suite (VIP)[12]. Both manufacturer s FPGA development environments were evaluated, the Altera Quartus II and Xilinx ISE Design suite. It was concluded that the Altera Quartus II is a better solution, because it integrates all required functions (design entry, compilation, simulation, programming) into one package. Also taken into account was the large availability of cores adhering to the Altera Avalon Interconnect Fabric standard, which together with the SOPC Builder software simplifies system design. To provide a complete and realistic overview of the reasons influencing this decision, it must also be noted, that one of the reasons tipping the selection into Altera s favor was the authors familiarity with devices of this manufacturer from lectures at FI MU. 17

23 Chapter 6 Evaluation of commercial IP cores from Altera Before designing the final hardware board to be used in the device, we designed an evaluation platform to test the video processing functions inside the FPGA and the interface chips used to convert the different video transmission standards to and from the FPGA input/output format. The evaluation board included an Cyclone III FPGA with 40k logic elements with the fastest speed grade available (EP3C40F484C6N). The FPGA had two DDR2 memory channels available, each consisting of two 16-bit DDR2 memory chips. The FPGA was connected to SDI input interface chips, DVI (TMDS) receiver, output DVI transmitter, USB communication bridge to allow for PC connection and other support ICs. On this board we evaluated the relevant IP cores to be used throughout the project and later developed our own intellectual property components. 6.1 Video and Image Processing Suite (VIP) The first to evaluate were the components from the Altera Video and Image Processing Suite. We were mainly interested in the Deinterlacer and Switch IP cores. The VIP cores can be instantiated either standalone or inside an SOPC system. With both approaches, the user is offered a MegaWizard configuration interface to select the required core functionality. We used the core in the Bob - Scanline Interpolation mode. This deinterlacing method adds lines to each half field by calculating the missing odd/even lines of the field. We selected this mode for that this interpolation method produces relatively clean image with no visible artifacts from merging two fields (such as the Weave method does) and since this algorithm uses only a few lines of image to buffer data so it produces very little delay. The visual quality of the processed video feed was found acceptable for the project, but the stability of the IP generation was found unsatisfactory. For visual quality testing, we used the IP core version integrated in the Quartus II package version 9.0. This version performed with no problems. When we switched to the Quartus II version 9.1, we could not compile any design containing the core. When the core was configured and the system was being generated (the parametrization of the core was under way by the configurator), the Quartus IDE crashed and was not able to recover. Since we had to use the 9.1 version (because it included a Switch component which we needed and which was not included 18

24 6. EVALUATION OF COMMERCIAL IP CORES FROM ALTERA in the 9.0 version IP library), we had to abandon using the provided deinterlacer component from Altera and had to develop our own solution. At the time of writing this thesis, when testing the deinterlacer core the compilation runs without any problems. This incident illustrates that the FPGA development toolchain is a rather complex software and should be thoroughly evaluated before considering using it in a design. Figure 6.1: Example of the MegaWizard configuration interface[20] 6.2 DDR2 High Performance Controller II We needed some form of large temporary storage memory to store the video data when synchronizing two video streams (frame buffer) and to store the captured image for the still image capture function of the system We decided to use DDR2 memory because it is the newest DDRx standard electrically supported on the Cyclone II/IV device family architecture. On the evaluation board we had integrated two channels of DDR2 memory channels, each consisting of two 16-bit DDR2 chips. This meant 64 bits effective transfer size per clock and 128 bits smallest transfer size when considering DDR2 minimal memory side burst size of 4 beats according to the JEDEC specification. 19

25 6. EVALUATION OF COMMERCIAL IP CORES FROM ALTERA Regarding the memory access pattern, we needed to read and write sequential areas of memory and therefore did not need the short memory side burst lengths of DDR I memory, which could be more appropriate for other algorithms such as realtime image rotation. We tested the memory controller core by running the memtest example included in the Nios II Embedded Design Suite. The tests passed with no problems and we therefore decided to use this core. 6.3 NIOS II soft processor To control the FPGA hardware an softcore processor was needed. Altera provides the Nios II 32-bit embedded processor for use on it s devices. The processor core can be configured into three versions, Economy, Standard and Fast. Since we did not need any video processing functions done on the CPU, we could use the Economy version on the core. The JTAG debugging and communication feature of the Nios II EDS development software proved very handy when debugging the system later in the project. 20

26 Chapter 7 Selected system structure This chapter describes the resulting internal FPGA video processor structure, This setup emerged after several design iterations. The structure took shape after considering the project requirements described above. It was necessary to display both the video feed from endoscopic camera and the administrative GUI application running on the internal x86 system. Therefore the FPGA has two video inputs. It is necessary to display the output video, so the FPGA has a video output. We needed to somehow communicate with the PC for system control and captured image transfer. For this reason the FPGA is connected to and USB FIFO bridge. To provide storage space for triple buffering and captured image storage, the FPGA has an DDR2 memory attached. The design was created using the Quartus II IDE. As a top level entity was selected a schematic file to provide a clear way of showing the system structure inside the Quartus project. Compared to a HDL top entity such as Verilog or VHDL file, the schematic quickly shows how the individual blocks are connected and communicates the information to the hardware designer. The block diagram and individual components of the system are discussed below. The components are covered only in brief detail, the three components comprising the core of this work are described in detail in a separate chapter. 7.1 Block diagram The system block diagram on figure 7.1 shows only the components relevant to the video processing paths. Supporting components like clock domain crossing, external support signals for the PCIe grabber, video resolution detectors, color space conversion cores etc. are not shown to maintain clarity. The system takes two video streams as input, processes them and outputs a single video feed as output. The video inputs are the camera input and the PC video input. The timing of the PC video input is taken as a reference timing, onto which the camera video feed is synchronized and blended using the frame buffer component. The frame buffer is also connected to the USB link providing a way to dump the contents of a memory location containing a captured image to a PC. One video processing path is the camera feed, another is the PC video feed. The camera video has to be synchronized to the PC video signal timing and if in an interlaced format it also must be deinterlaced. Placing the deinterlacer after the 21

27 7. SELECTED SYSTEM STRUCTURE frame buffer component saves memory bandwidth, since it allows for buffering of the half fields only and the final full frame is calculated using the deinterlacer after the synchronization phase. This also means that the images transferred to the host x86 system are half fields (for the 1080i interlaced input video format) and have to be stretched to original aspect ratio. It was found that this solution is perfectly acceptable since there is no visible reduction of quality of the captured image. Camera video input USB link Frame buffer PC video input Deinterlacer Alpha blender Video output Figure 7.1: Final video processing system block diagram 7.2 Camera video input The component providing the video input to the system is designed as an Avalon Interface Fabric compatible component. The input parallel video data from the external SDI receiver chip are converted from YCbCr color space into RGB color space using a simple pipelined calculation and then the data are fed into a dual 22

28 7. SELECTED SYSTEM STRUCTURE clock FIFO (standard component provided by Altera). This effectively transitions the data from exact-time-formatted data for display into a data stream suitable for internal processing. The remainder of the component is an Avalon Memory Mapped Master externally controller by the Nios II CPU. The master component can be thought of as an DMA engine, which converts a video frame into the DDR2 storage memory using long Avalon side bursts. To relax the frequency requirements for the core logic fabric, the width of the bus from the FIFO to the memory controller IP is set to 64 bits. This effectively halves the frequency at which the bus must run to transfer the data and therefore easies the fitter effort to reach timing closure. This component is displayed as standalone in the block diagram but is effectively part of the frame buffer subsystem. 7.3 Frame buffer The frame buffer subsystem provides the means to synchronize the camera video feed to that of the PC. Although this introduces delay (of at most one half field), which could be avoided by synchronizing the PC video feed to that of the camera instead, it was supposed that since the PC video feed comes from inside of the device from the host x86 system, it is more reliable and under control than the unknown camera signal from outside of the system and is therefore more usable as a reference timing signal. The frame buffer uses a standard triple buffering scheduling algorithm, where one buffer is always available to save an incoming video frame. This provides the means to synchronize the two video streams, since the frame buffer scheduler can either drop or repeat a field to match the required timing. Together with writing the raw image data to the DDR2 memory, the scheduler also registers whether the currently transferred field is odd or even. The scheduler has the field signal available from the external SDI receiver chip. This information is later used to properly configure the deinterlacer block at the output of the frame buffer. The frame buffer then includes an output component which reads a stored field from memory and outputs the data into an input dual clock FIFO of the deinterlacer component. The frame buffer component is described in detail in a separate chapter. 7.4 USB link The frame buffer subsystem also contains a USB link component on the Avalon Interconnect Fabric. This provides the capability of the system to transfer the stored image data to the PC using a USB FIFO bridge (FT2232H from FTDI[21]). The size of a single half field in 1080i video format is about 4 megabytes and is transferred in under two seconds. 23

29 7. SELECTED SYSTEM STRUCTURE The USB interface IC has two channels, one is configured for the RS232 standard and is used for FPGA system control, the other channel is a one way communication link to the PC for captured still image transfers. The control channel is connected to an UART component of a controlling SOPC system with the Nios II soft core processor. 7.5 Deinterlacer The deinterlacer component is fed the video data by the frame buffer component, this video data is deinterlaced (if requested) into a full frame and output to the alpha blender component. The deinterlacer core is described in a separate chapter. 7.6 PC video input The video data from the host x86 system is fed into the FPGA using an external DVI receiver chip. The data passes into the alpha blender, where it is mixed with the video feed from the camera and output from the FPGA into an external DVI transmitter IC. 7.7 Alpha blender The alpha blender IP core takes the two video streams and mixes them together using an alpha value provided by the Nios II controller system. The alpha value is controllable from the PC over the USB link and therefore allows for video stream switching. The alpha blender includes a transparent color definition. When the alpha blender encounters a pixel with this color in the video data of the host x86 system, the alpha blender displays the original pixel color from the camera feed regardless of the alpha setting. This basically provides the overlay function known from the PC world. This was implemented to allow the system not only to blend the two streams together, but to also enable the on screen display (OSD) generation. The transparent color definition allows for displaying original camera video data with non-transparent OSD mixed on top of this feed. The alpha blender component is also described in more detail in a separate chapter. 24

30 Chapter 8 Example video processing cores This chapter describes the IP cores developed to provide the video processing functions of the system, as required by the project requirements. All the cores were written using the Verilog HDL language. Compared to VHDL, the Verilog hardware description language was perceived as more readable and developer friendly. The cores process the video stream data in a stream format - the input components of the processing chain convert the video data from the exact timing format to a stream format, stripping the video data of the synchronization information and forwarding only the active picture data. 8.1 Deinterlacer Deinterlacing is used to convert from interlaced video format to a progressive one. In interlaced video stream, each complete frame is transferred as two half fields, odd and even. Odd field contains odd picture lines and even field contains even picture lines. By splitting the complete frame info two half fields, the temporal resolution of the video feed is doubled and the motion appears more smooth. Progressive video format transfers frames as complete units, each frame containing all (odd and even) lines of the picture. Progressive video does not have the same temporal resolution as an interlaced video with the same bandwidth, on the other hand it offers better vertical resolution and therefore more detailed image. Interlaced format is commonly used in broadcast applications and TV industry, whereas progressive format is more common in the PC industry Algorithm overview A system converting an interlaced video signal to progressive is called a deinterlacer. There are several methods on how to accomplish the conversion: Bob - line duplication Bob - line interpolation Weave - merging of lines of odd and even field 25

31 Motion adaptive algorithms 8. EXAMPLE VIDEO PROCESSING CORES Bob is a name given to algorithms needing only one half field to produce a complete progressive frame. The individual approaches are described below. Line duplication Line duplication algorithm simply takes the input line and produces two lines on output, each same as the image line on deinterlacer input. This is the simplest deinterlacing algorithm, however also the one with the lowest output image quality. Since the half field lines are duplicated, the output progressive image appears pixelated in the vertical direction. This is especially visible on sharp, highly contrasting edges in the image. Figure 8.1: Output of the line duplication algorithm[23] Line interpolation Line interpolation algorithm does not replicate the missing lines, but instead it calculates the missing line from the line above and below the missing one. This produces a complete frame from a single half field, with the quality of the output image better than the line duplication algorithm. The most visible improvement is that the sharp contrasting edges appear more smooth thanks to the interpolated lines. Weave algorithm The weave algorithm uses two half fields to produce a progressive output frame. The method works by merging the odd and even fields directly into one frame. 26

32 8. EXAMPLE VIDEO PROCESSING CORES Figure 8.2: Output of the line interpolation algorithm[23] Compared to the Bob algorithms this method needs a storage memory to temporarily store the half field data. This also introduces a half field delay to the processing chain since the deinterlacer must wait for complete field to produce an output progressive frame. The output quality of this algorithm is compromised by artifacts on edges in the resulting progressive image; since the fields used to produce the output originate in different points in time, when the video feed contains scenes with fast movement, the edges appear distorted since each field captures the moving object in a different position. Motion adaptive algorithm Motion adaptive algorithms try to predict the areas of the image with movement and try to compensate for the motion by calculating the final progressive frame from several preceding half fields of the interlaced input. Additionally to the requirements for storage of the preceding half fields, this algorithm also introduces delay to the video processing chain. This delay depends on the specific motion adaptive algorithm used Algorithm selection After testing the above mentioned algorithms on a development board, we selected the line interpolation algorithm. The quality of the output image was found acceptable, since the edges appear smooth and there are no saw tooth artifacts as is the case with the weave algorithm. Also, since this method does not need any preceding half fields to produce an output frame, the latency introduced into the processing chain is very small - typically the duration of a single image line. 27

33 8. EXAMPLE VIDEO PROCESSING CORES Figure 8.3: Output of the weave algorithm[23] Principle of operation The deinterlacer core uses the line interpolation algorithm to convert the input interlaced video to the progressive output format. The input data in stream format are fed to the core by the frame buffer component. The output of the core is connected to the alpha blender core, where it is mixed with a second video feed and output to the external DVI transmitter chip. The core is reset with the beginning of each input field. After the reset signal is deasserted, the core detects whether the current field is odd or even and also registers the video format resolution as detected by the preceding components of the video processing chain. The core uses two counters, x and y to store the actual position within the video frame. The core has three options as to what to do with each processed line: A - store the incoming line into the temporary buffer, and, at the same time, output the line B - store the incoming line into the temporary buffer, and, at the same time, output the average (interpolation) of the line being currently stored in the temporary buffer and the line already saved in the temporary buffer C - do not store anything, just output the line already stored in the temporary buffer The decision between performing the action A, B or C is made by the core scheduler. This part of the deinterlacer keeps the current position in the video image and performs configuration of the remaining parts of the component at the beginning of each image line. 28

34 8. EXAMPLE VIDEO PROCESSING CORES Implementation The core is implemented as a schematic file instantiating the subentities designed in Verilog hardware description language. The core also uses Altera specific components included in the Quartus IP library. The core has three main groups of virtual I/O pins exported to the higher level design file - video stream input, video stream output and control signals. Video stream input pin group is composed of signals fifo data input[63..0], fifo rdempty input and fifo rdreq output. These signals form an interface to the FIFO data buffer of the frame buffer component. Video stream output pin group is composed of pins out rdreq, out rdclk and out data[23..0]. These signals provide the interface to the alpha blender component mixing the two streams to form the output video signal. The remaining signals form the deinterlacer core control signals. The main signals in this group are clock, reset, video resolution and the deinterlacing enable signal to optionally disable the deinterlacing to let progressive video format pass through unchanged for the 720p progressive input camera video format. The core scheduler is located in the deinterlacer controller submodule. This module controls the state transitions at the beginning of each image line as described in the previous chapter. The core scheduler algorithm is the following: i f ( can advance == 1) begin x = x + 1 ; i f ( x == x s i z e ) begin x = 0 ; y = y + 1 ; end i f ( f i e l d == 0) begin i f ( y == 0) m a s t e r s t a t e = 2 ; i f ( ( y >= 1) & ( y < ( y size minus one ) ) ) m a s t e r s t a t e [ 1 : 0 ] = {1 b0, y [ 0 ] } ; i f ( y == ( y size minus one ) ) m a s t e r s t a t e = 0 ; end e lse begin i f ( y == 0) m a s t e r s t a t e = 2 ; i f ( y == 1) m a s t e r s t a t e = 0 ; i f ( ( y >= 2) & ( y < y s i z e ) ) m a s t e r s t a t e [ 1 : 0 ] = {1 b0, y [ 0 ] } ; end end 29

35 8. EXAMPLE VIDEO PROCESSING CORES The variables x, y contain the actual position within the video image data, field is the even/odd field indicator, master state[1:0] is a variable indicating which of the actions A, B or C should the deinterlacer perform on the actual line and can advance is a signal indicating that the remaining core components are ready for next data item. The deinterlacer ram buffer module is Altera-specific instantiation of an embedded memory block forming a RAM memory to store the image line. The address to this embedded RAM memory block is controlled by the scheduler, the deinterlacer mem addr delay module delays the address signals for the line operation B. The operation B means that the deinterlacer must store the incoming line to the RAM buffer and at the same time load the data from the very same memory buffer. Therefore, it is necessary that the data from the buffer can be read out before the new image line data are saved to the buffer. The deinterlacer line switch module provides the switching between operations A, B and C as requested by the scheduler module. Operation A (master state = 2) means that the data received from the frame buffer component is stored to the RAM buffer and at the same time the data is routed through the deinterlacer line switch to the output FIFO. Operation B (master state = 1) means that the incoming data is stored to the RAM buffer and at the same time the previous line data stored in the RAM buffer are read out, sent to the deinterlacer line switch where the pixel data is averaged (interpolated) with the actual line data and sent to the output FIFO. Operation C (master state = 0) does not read the incoming pixel data but instead simply outputs the stored line from the RAM buffer to the output FIFO. The remaining components of the deinterlacer core are mainly support functions to properly align the individual data and control signals to compensate for the latency of the respective communicating components. To relax the requirements for the maximum frequency of the device logic fabric, the deinterlacer core processes two pixels at a time. This doubles the used data bus width, but at the same time allows to halve the operating frequency while maintaining the required bus bandwidth. The deinterlacer core expects the field data in a standard RGB color space with every color component having 8 bit value range ( ). The interpolation (vertical averaging) of the neighboring half field image lines is done by adding the individual red, green and blue components of the pixel color (the two pixels in the RAM buffer from the previous image line and the two pixels currently being received and stored to the RAM buffer) together and then doing an one bit position shift right, thereby calculating an arithmetic average of the two values. 30

36 8. EXAMPLE VIDEO PROCESSING CORES Figure 8.4: Top level entity of the deinterlacer component[20] 31

37 8. EXAMPLE VIDEO PROCESSING CORES 8.2 Alpha blender Alpha blending is an image processing algorithm for mixing two images into one, with the option to select the transparency of individual picture elements. In video stream processing, the input images are formed by the active picture data of the individual video frames. The transparency is selected by the alpha channel, which for each pixel defines a transparency value. The range of the transparency value 0.0 to 1.0 can be translated to integer representation, for example with 8- bit resolution the range is The value 0 means that the first image is fully visible with no visual input from the second one and vice versa. Value of the final pixel is usually calculated by calculating the individual elements of the pixel color for each coordinate in the pixel s color space. For example, with the RGB color space, the calculation can be described by the following equations: out R = layera R layera alpha + layerb R (1 layera alpha ) out G = layera G layera alpha + layerb G (1 layera alpha ) (8.1) out B = layera B layera alpha + layerb B (1 layera alpha ) The alpha value for each pixel can be either fixed for the entire image or delivered to the blender core as a separate value for each individual pixel, for example as the unused 8 bits within 32-bit pixel memory window for 24-bit pixel colors. In this work, the blender core has a fixed value for the alpha channel for the entire active picture window. Although initially was the per-pixel alpha channel considered, to provide a simple way for the OSD menu generation, the fixed alpha solution was preferred. The main reason for this decision was that the PC video feed is used as the source for the OSD menu and it would be problematic to transmit the alpha channel through the standard 24-bit per color DVI interface. Using the fixed alpha value, the entire range of the pixel value of the DVI interface can be used for pixel color space coordinates and the OSD generation is achieved by simply displaying an image on the x86 host system graphics output. This solution also has its drawbacks, most notably the inability to display a non-transparent OSD image on top of the live camera video feed. This was resolved to dedicating a single pixel color from the x86 host system a the transparent color value. When this color is encountered by the blender core, the value of the camera video pixel is assigned to the output, regardless of the alpha value setting. This allows for the generation of either non-transparent or semitransparent OSD image on top of the live video feed Principle of operation The core processes two input pixel streams and produces a blended pixel stream on the output. One input stream is a directly connected video feed from the x86 host system, which is used as a reference video signal for the output video feed. 32

38 8. EXAMPLE VIDEO PROCESSING CORES This means that the output video feed has the same parameters (pixel clock, timing, resolution) as the video feed from the x86 host system. Into this video feed is mixed the live video signal from the camera input using the preceding frame buffer and deinterlacer components. This allows the system to mix these two streams with no interruptions in output video timing, since the camera feed is passed through the frame buffer component and can be therefore matched to the reference video signal. The calculation of the output pixel value is divided into separate calculations for each color component of the pixel color. Each calculation of the output color component is then further divided into pipelined calculation stages to relax the timing requirements of the design compared to the case with no pipelining done. For the calculation of output values the blender core uses the equations 8.1 translated into the integer domain Implementation The blender component is implemented as a Verilog HDL entity, instantiated in a higher level schematic design file in the Quartus design environment. The reference input video signal is fed to the core using the pixel b in[23..0] bus together with the reference video timing signals de in, hsync in and vsync in. The core is clocked using the reference video signal clock connected to the core clock input clock in. Figure 8.5: Schematic symbol for the blender module[20] The output video signal is formed by the output pixel out[23..0] together with the timing control signals de out, hsync out and vsync out. The output video feed uses the same clock as the input reference video feed, i.e. clock in. Following is a code walk through for a single color component (red). The core starts by registering the input information to reduce the length of the input path and therefore to improve the maximum operating frequency of the core. posedge c l o c k i n ) begin p i x e l a <= p i x e l a i n ; p i x e l b <= p i x e l b i n ; end 33

39 8. EXAMPLE VIDEO PROCESSING CORES Now the core has the input pixel information available in the internal registers. The alpha value for the current video frame is registered during the vertical blanking interval of the reference video signal. This way, the alpha value is forced to be the same for each individual video frame. To further improve the maximum frequency of the core, the alpha value for both video inputs is immediately calculated (the layera alpha and the (1 - layera alpha ) expression as described in equations 8.1). posedge c l o c k i n ) begin i f ( vsync in == 1) begin alpha a [ 7 : 0 ] = alpha [ 7 : 0 ] ; alpha b [ 7 : 0 ] = 8 d255 alpha [ 7 : 0 ] ; end end The blender core then continues by calculation of the intermediate values for the expressions listed in (8.1). The core produces intermediate values for pixel a and pixel B color components. Since there were some problems with the integer representation of the equations (uneven mapping of the multiplication results, the value for component output with layerx alpha = 255 was 254), the core checks for the alpha value and if found maximal, the core simply outputs the respective color components. If this is not the case, the core performs an integer multiplication of the pixel color component and the alpha value. posedge c l o c k i n ) begin i f ( alpha a == 255) begin red a [ 1 5 : 0 ] = { p i x e l a [ 7 : 0 ], 8 b } ; red b = 0 ; end else red a = p i x e l a [ 7 : 0 ] * alpha a ; i f ( alpha b == 255) begin red b [ 1 5 : 0 ] = { p i x e l b [ 7 : 0 ], 8 b } ; red a = 0 ; end else red b = p i x e l b [ 7 : 0 ] * alpha b ; red a pipe = red a ; red b pipe = red b ; end The intermediate multiplication values are pipelined using additional regis- 34

40 8. EXAMPLE VIDEO PROCESSING CORES ters red a pipe and red b pipe. The core then continues by producing the final pixel output value. In this step the core also checks for the transparent color as described above and decides whether to output the pixel value based on the previous calculations or whether to output the camera video feed pixel value directly. The color selected as the transparent color for the video overlay is 0xFF00FF (magenta), considered very unlikely to appear in the x86 host system video output under normal conditions. posedge c l o c k i n ) begin i f ( pc 1 == 24 hff00ff ) begin red [ 1 5 : 8 ] = cam 1 [ 7 : 0 ] ; green [ 1 5 : 8 ] = cam 1 [ 1 5 : 8 ] ; blue [ 1 5 : 8 ] = cam 1 [ 2 3 : 1 6 ] ; end else begin red = red a pipe + red b pipe ; green = green a pipe + green b pipe ; blue = blue a pipe + blue b pipe ; end end The cam 1 register stores the camera input pixel value for the actually processed pixel, the pc 1 register stores the original reference video feed pixel value. Separate registers are necessary, since in this stage of the processing pipeline the original pixel a and pixel b registers contain newer pixels due to the processing latency. This is the last step of the processing pipeline. To compensate for the latency introduced by the individual processing stages, it is necessary to also properly align the output video timing signals to match the active picture data. This is done by a simple delay stage inside the blender core. posedge c l o c k i n ) begin delay 3 = delay 2 ; delay 2 = delay 1 ; delay 1 = { de in, hsync in, vsync in } ; end The individual bits of the delay 3 register are then output as the final timing control signals. At first, the design of the core did not use any pipelining and the core had very low performance of terms of maximum allowable frequency of the incoming video signal. After introducing the pipelining stages, the core is capable of handling 150+ MHz input video signal pixel clocks and therefore supports the required HD resolutions (the pixel clock of the 1080p video format is 148.5MHz). 35

41 8. EXAMPLE VIDEO PROCESSING CORES 8.3 Frame buffer Frame buffering is a method for synchronizing data producer and data consumer, each running at a different processing rate. It can be used in situations where the data stream is divided into compact units, suitable for processing on a per-unit basis. This condition is fulfilled for video processing, since the video stream is composed of individual video frames and these are usually processed separately. Frame buffering works by allocating a number of memory buffer regions for storing the incoming data segments. The number of buffers used depends on the selected frame buffering method. Data producer Data consumer buffer switch Data consumer Data producer Buffer A (writing) Buffer B (reading) Buffer A (reading) Buffer B (writing) Figure 8.6: Principle of operation of a double buffering system Double buffering uses two buffers, one for data producer and one for data consumer. Use of double buffering method is limited to cases where the data production can be controlled in a way as to work synchronously with the data consumption. For example, this is the case with graphic cards in the PC industry. To remove video tearing during the display of rendered scenes, the GPU renders the scene into a different buffer than the one used to send the frame data to the monitor. These buffers are flipped in the vertical blanking period of the monitor output timing signal. This removes video tearing but at the same time it introduces inefficiency to the render process, since the GPU has to wait with the start of frame rendering process for the start of the vertical blanking interval (otherwise the GPU has no free buffer to render the scene into). For situations in which the data production can not be synchronized with data consumption, double buffering is not optimal since this method has to drop a large number of data units. When processing video, this behavior is clearly visible in scenes with moving objects where the motion appears choppy and unnatural. Triple buffering removes the problems of double buffering by introducing a third buffer. This ensures that one buffer is always available for either the data producer or the data consumer. When considering using triple buffering for the synchronization of two video streams, each having a different frame rate, this method can provide a simple frame rate conversion. For example, if the synchronized stream has a lower frame rate than the stream being synchronized to, the data consumer can reuse the cur- 36

42 8. EXAMPLE VIDEO PROCESSING CORES Data producer Data consumer Data producer Data consumer Buffer A (writing) Buffer B (reading) data producer has finished storing data and requests a free buffer Buffer C (writing) Buffer B (reading) Buffer C (idle) Buffer A (idle) Figure 8.7: Principle of operation of a triple buffering system rently processed frame by duplicating it without any impact on the data producer and the frame storage. The same is valid for an opposite scenario, where the data producer stores the frames at a higher rate than the data consumer is processing the frames. In this case, the frames can be dropped from the output stream, again without any interruption of the data consumption process Principle of operation The frame buffer was implemented using an SOPC builder generated component. The module uses several subcomponents connected together by the Avalon Interface developed by Altera[25]. The main parts of the frame buffer module are the frame writer component, frame reader component, usb reader component and the necessary DDR2 memory controller core. By using the Avalon interface, the system could take advantage of the Altera provided High Performance DDR2 Memory Controller Core. The frame writer was based on an example template of Avalon Interface Memory Mapped Master with burst writes[26]. This is a template for a master component located in an Avalon interface SOPC system. This component can be thought of as an DMA (Direct Memory Access) engine with the source data not being in the target address space. The component provides a bridging function between the Avalon subsystem and user logic design. The original component includes a single clock FIFO to provide the input from the user design into the Avalon subsystem. For the purposes of this project this single clock FIFO was replaced with a dual clock version of the same, therefore providing the clock crossing function required to convert the video data from the camera video input clock domain to the internal Avalon subsystem clock domain. There were also some other modifications to the template, mainly regarding the start of the transfer. The bursting capability enables the DMA engine to transfer large portions of the video data, 37

43 8. EXAMPLE VIDEO PROCESSING CORES this greatly improves the efficiency of DDR2 memory access. The bursting capability is more discussed in the implementation section. Figure 8.8: Structure of the read master template component[26] The frame reader is also based on the Altera template, namely on an Avalon Interface Memory Mapped Master with burst writes. This component works in the same way as the frame reader, but it instead reads data from the DDR2 memory address space and transfers them using an dual clock FIFO into the user design. The user design is in this case the deinterlacer module. The USB reader component is based on the same component as the frame reader, but this time the output is connected to the external USB to FIFO bridge (FT2232H[21]) from FTDI. The component was modified to adapt to the timing required by the external USB bridge chip. The last part of the frame buffer component is the Altera DDR2 High Performance Memory Controller core. This component provides the means to access the external DDR2 memory chip by mapping it into the Avalon subsystem memory space. This entire component is controlled by a second SOPC builder subsystem located inside the FPGA. This subsystem provides the control functions for the entire FPGA design and the peripheral devices, including the communication link of the device to the PC using a second data channel of the FT2232H bridge, control of the timing of frame reading and writing of the frame buffer components, input video format detection and other support functions of the FPGA system. This subsystem is not described here due to the reasons described in the Preface to this work. 38

44 8. EXAMPLE VIDEO PROCESSING CORES Implementation As was said above, the component was implemented as a SOPC builder subsystem. The system is clocked by a 81MHz clock signal generated by an PLL component from an input 27MHz clock signal. To reduce the maximum operating frequency requirements, the design uses a 64-bit width of the interconnecting Avalon bus fabric. Figure 8.9: The frame buffer module as seen in SOPC builder[20] Local side of the DDR2 memory controller core is the connection to the Avalon bus fabric, memory side is the connection to the external DDR2 memory chip. The DDR2 memory controller core is configured to run the external DDR2 16-bit memory chip at 162MHz, the half frequency of the local side bus of the frame buffer subsystem. Since both clocks are generated by a single PLL from the same reference clock source, there is no need for a clock crossing bridge when moving data to and from the DDR2 memory space. The memory controller is configured to use a burst length of 8 beats on the memory side. This translates to 128-bits of a local side data transaction being transferred on two clock periods of the local side bus, since (in bytes): 2(DDR2W idth) 2(DoubleDataRate) 4(M emorysideclocks) = 16 This provides a highly efficient data path to transfer the video frame data. The DDR2 memory controller core is also configured to allow for a local burst size of 64 words. This means that the core can be exclusively accessed by one of the three masters on the bus (frame reader, frame writer, USB link) for a maximum of 64 data word transfers, further improving the memory access efficiency. When testing the frame buffer subsystem on a custom development board, the DDR2 memory channel had the width of 32 physical bits (there were two physical x16 DDR2 chips, each storing one half of the 32-bit word) and the frame buffer subsystem did not use any burst transfers. With this wide channel configuration, the system worked fine. However, after trying to optimize the design to use only a single x16 DDR2 memory device, it was found that the single x16 DDR2 device is not capable to provide the required bandwidth with this setup. This was clearly demonstrated on the quality of the resulting output video feed, from mild occasional tearing to a complete destruction of the output video structure after few seconds from the system startup. After some elaboration with the 39

45 8. EXAMPLE VIDEO PROCESSING CORES SignalTap II logic analyzer it was found that the DDR2 memory is accessed in a highly inefficient manner. The memory address space is shared by three Avalon master components, frame reader, frame writer and USB link. Since the master access to the slave memory address space is scheduled by an round robin arbiter, this resulted in a high number of accesses for a very small data items. The memory controller had to continuously switch either the currently accessed row and column or switch between read and write operations. Both these events cause a high performance penalty by introducing wait cycles in between the individual operations. For this reason, the then-used Avalon example templates were changed to the bursting capable versions and the then-used memory controller core was changed to the burst capable version. This resulted in a significant performance improvement of the system, since now the memory core was accessed for very large transactions, diminishing the performance penalties of the small and frequent access pattern. The read master template used as a base for the custom project components was modified by adding a dual clock FIFO instead of the originally used single clock. Because the Verilog module source instantiates the FIFO as a standalone entity and the interfaces are relatively similar for both the dual and single clock version, the modification was fairly straightforward. t h e m a s t e r t o v i d e o f i f o t h e m a s t e r t o u s e r f i f o (. a c l r ( c o n t r o l r e s e t ),. data ( master readdata ),. rdclk ( f i f o r e a d c l o c k i n p u t ),. rdreq ( f i f o r e a d r d r e q i n p u t ),. wrclk ( c l k ),. wrreq ( master readdatavalid ),. q ( f i f o r e a d d a t a o u t p u t ),. rdempty ( fifo read rdempty output ),. wrusedw ( f i f o u s e d ),. wrfull ( ) ) ; The other significant modification to the reader template was the inclusion of the FT2232H FIFO interface for the USB link component. This was needed to properly present the parallel data to the external chip. The frame writer component works in the same way as the frame reader component with the data direction reversed. The input video data from the camera is fed to the input dual clock FIFO by using the data enable signal of the video timing information. By using this signal as a write enable signal for the FIFO, only the active picture data are stored to the FIFO component. 40

46 Chapter 9 FPGA design flow This chapter does not describe the FPGA design flow in the traditional sense - design entry, synthesis, fitting phase, timing analysis, programming - since there is plenty of information available on this topic elsewhere[24]. Instead, it tries to provide a real world experiences and hopefully useful tips for designing an embedded system containing an FPGA. This chapter focuses mainly on the Altera Quartus II development software, but the principles can be used for design flow with any FPGA device vendor. 9.1 Separate projects for custom components When developing a FPGA system with custom designed IP components, develop the custom cores as a standalone project (where possible). This allows the designer to focus on the component functionality rather than on the interactions with the rest of the system, reduces the design iteration compilation time and allows for component reusability. 9.2 Use standard interfaces The system design can be greatly simplified by using standard interfaces wherever possible. A good example is the Avalon Interface used in SoPC designs using Altera devices. By designing the component interface according to a standard, the designer can then use the component together with IP cores developed by third parties. This saves development time and provides greater flexibility than by developing a custom interface. 9.3 Optimize the memory access pattern When the design uses some form of DDRx memory for data storage, think about and optimize the memory access pattern. The most common caveats are not using burst transfers for sequential data access and inefficiencies in bus width adapters. For example, the Nios II softcore processor (at least the current version) accesses memory in 32-bit words. When the memory controller core has a different local side width, this may result in an inefficient transfers since e.g. half of the 41

47 9. FPGA DESIGN FLOW data word is discarded. This may hold true for other components of the Avalon fabric as well. The memory access pattern is an important design decision as well. As described in the previous chapter, the differences in memory bandwidth between bursting and non-bursting transfers can be huge. There is plenty of information available on how to optimize the access patterns for a given task[27][28]. 9.4 SignalTap II logic analyzer A very useful feature of the Quartus design environment is the SignalTap logic analyzer. When included into the project, the SignalTap component creates a logic analyzer from unused device resources and allows the designer to see the system behavior in real time. This is especially useful for debugging, where by setting the appropriate trigger conditions the designer can observe the real behavior of problematic parts of the design. Figure 9.1: SignalTap II logic analyzer interface[20] 9.5 Horizontal and vertical device migration When designing a particular FPGA system, it is useful to include the option to migrate to a different device or even to a different device family with a single PCB layout. This adds more robustness to the project, because the hardware board is not limited to a single device and thus the system can be assembled even if the original device is not available. This mostly concerns the physical I/O pin mappings, where the designer has to do an intersect of the I/O pins of the devices considered for migration. There is also an added flexibility should the design requirements change - device with more or less resources can be used instead of the originally planned one. For example, with the recent earthquake and subsequent tsunami in Japan, the Altera supply chain was damaged and the target device for the project (Cyclone IV, 15k LEs, FBGA256) was not available from any major electronic components distributor. By designing the project for two device families (Cyclone III and IV), the system was assembled using Cyclone III family device (Cyclone III, 16k LEs, FBGA256) and the project schedule was not affected. 42

48 9.6 Physical I/O pin mapping 9. FPGA DESIGN FLOW When mapping the design I/O pins to a specific device, the designer should work closely with the layouter (the person designing the printed circuit board where the FPGA chip is placed). By placing the design pins on optimal I/O positions of the device, the FPGA designer can reduce the complexity of the printed circuit board and therefore decrease the layout time and the cost of the PCB by reducing the number of required layers. 43

49 Chapter 10 Resulting hardware After the evaluation of the design on an development board a final hardware board was developed for the FPGA video processor. Based on the design requirements a target device was selected to fit the design. The device selected from the Cyclone IV family was EP4CE15F17C7N with the option to migrate to an older Cyclone III generation with a device EP3C16F256C7N. These devices have very similar pinout and therefore can be interchanged with no major problems. Figure 10.1: Final hardware board containing the FPGA video processor The entire system resource usage (after optimizations done by the synthesis and fitter compilation phases) is logic elements (88% usage), memory bits (61% of the device resources) and 151 pins (91%). An illustration of the final pin usage can be seen in the appendix section to this work. Of this resource usage the deinterlacer block uses 501 LEs, alpha blender uses 228 and the frame buffer component uses 8269 LEs. The high resource usage of the frame buffer is mostly caused by the DDR2 memory controller core (6304 LEs) and Avalon bus arbitration (776 LEs). The reader and writer components use about 400 logic elements each. The DDR2 x16 memory is connected to banks 3 and 4, the other I/O pins are mostly used to connect to the parallel video data buses of the two video input ICs and one video output chip. 44

50 10. RESULTING HARDWARE 10.1 Verification of the hardware The design was verified by several methods. The developed cores were simulated using (at that time) the Simulator utility integrated in the Quartus software. The designs were simulated using functional simulation to verify the correct functionality of the module. After successful simulation the cores were tested in a real system to validate correct function. The behavior of the core under test was again checked by the SignalTap II logic analyzer. After these stages the entire FPGA system was developed, placed on a real hardware board, assembled into a final prototype and this prototype was thoroughly tested. Visual image quality was checked, communication with the PC and overall device stability were evaluated. The prototype also underwent thermal stress testing to validate the functionality in the entire operating temperature range as required by the design specification. 45

51 Chapter 11 Conclusion This work presented the overview of the development process of a hardware device using Field Programmable Gate Array and provided some example IP components for video processing to illustrate the possibilities and approaches used for said development. The capabilities of the selected low cost FPGA family from Altera was found to satisfy the requirements for processing power with no major problems. The flexibility of the FPGA architecture proved very useful since there were many minor modifications to the system behavior throughout the development of the project. Although the per-chip price of FPGA is still higher than that of the ASIC, for small production series this difference is balanced by the flexibility and the ability to customize the design to the project specific requirements. The overall development of the system took about one year, as of time of writing of this thesis the system is being prepared for production. The latency of the displayed live video feed was found well within the required range, providing the final users with realistic response of the system to the scene being displayed in the video feed. Except the cores developed by the author, the design uses many ready to use cores provided by Altera. The experiences with their use are mostly positive, with the exception of some of the Video and Image Processing Suite components which had problems even to successfully compile. This illustrates the need to properly evaluate the IP considered to use before the actual development. Of course, the design could be done better. With retrospect, the design may be done more effectively in terms of simplicity and overall readability. For example, the frame buffer component is controlled by a SOPC controller with the Nios II processor core. All the control signals and data busses are routed as connections in the top level schematic file from the controller module to the frame buffer module. Together with the frequent modifications to the design this created a rather unreadable and messy mesh of interconnecting wires. Should the system be developed now, all of the control signals would be removed by setting the core parameters using registers accessible via the Avalon interface. This would result in a single entity, clean and readable. The problematics of FPGA design is a rather attractive one; since many computing and embedded projects slowly transition from single core to parallel architectures, the programmable logic is a viable field of interest. With the announced production level quality partial reconfiguration[29] the devices may soon be able 46

52 11. CONCLUSION (in an usable way) to instantiate portions of hardware design on the go, practically providing hardware on demand capability. This may well be the case for future architectures of a digital system, where some fixed silicon sequential core loads hardware blocks as required by an operating system. There may be interesting research topics in this area considering the possible implications of the discovery of the memristor[7], which is a circuit element capable of acting both as a storage element and as logic. The implications of the merger of high density storage together with programmable logic resources would be immense. Should the memristor discovery indeed deliver the technology for the next generation of programmable logic, the devices may be capable of further reducing the performance gap between programmable and fixed silicon and start to appear in systems traditionally relying on an ASIC solution. 47

53 Bibliography [1] IBSmm Engineering Czech Design Center official website [online]. IBSmm CDC, [cited ]. URL: [2] Wikipedia: Moore s law [online]. Wikipedia, [cited ]. URL: http: //en.wikipedia.org/wiki/moore%27s_law. [3] Wikipedia: Programmable logic devices [online]. Wikipedia, [cited ]. URL: devices. [4] Wikipedia: Field-programmable gate array [online]. Wikipedia, [cited ]. URL: Field-programmable_gate_array. [5] Altera Cyclone IV Device Handbook [online]. Altera Corp., [cited ]. URL: cyiv-5v1.pdf. [6] Altera Stratix V Device Family Overview [online]. Altera Corp., [cited ]. URL: stx5_51001.pdf. [7] Finding the Missing Memristor, R. Stanley Williams [online]. Lecture recording, YouTube, [cited ]. URL: com/watch?v=bkghvkyjgly. [8] Achronix Speedster22i product overview [online]. Achronix Semiconductor Corp., [cited ]. URL: speedster22i.html. [9] Wikipedia: Memristor [online]. Wikipedia, [cited ]. URL: http: //en.wikipedia.org/wiki/memristor. [10] End of the CPU? HP demos configurable memristor [online]. R Colin Johnson, EE Times, [cited ]. URL: End-of-the-CPU-HP-demos-configurable-memristor. [11] Intel Introduces Configurable Intel Atom-based Processor [online]. Press release, Intel Corporation, [cited ]. URL: intel.com/docs/doc

54 11. CONCLUSION [12] Altera Video and Image Processing Suite User Guide [online]. Altera Corp., [cited ]. URL: ug/ug_vip.pdf. [13] Wiki: File:YCbCr-CbCr Scaled Y50.png [online]. Wikipedia, [cited ]. URL: Scaled_Y50.png. [14] Wiki: File:BNC connector (male).jpg [online]. Wikipedia, [cited ]. URL: %28male%29.jpg. [15] Wiki: File:Dvi-cable.jpg [online]. Wikipedia, [cited ]. URL: http: //en.wikipedia.org/wiki/file:dvi-cable.jpg. [16] Video Electronics Standards Association official website [online]. VESA, [cited ]. URL: [17] Society of Motion Picture and Television Engineers official website [online]. SMPTE, [cited ]. URL: [18] Wiki: Serial Digital Interface [online]. Wikipedia, [cited ]. URL: [19] Altera Corporation official website [online]. Altera Corp., [cited ]. URL: [20] Altera Quartus II 11.0 Web Edition development environment application, build /27/2011. Altera Corporation. [21] FT2232H Dual High Speed USB to multipurpose UART/FIFO device datasheet [online]. FTDI Chip, [cited ]. URL: ICs/DS_FT2232H.pdf. [22] Future Technology Devices International Ltd. (FTDI): FT245R device datasheet [online]. Future Technology Devices International Ltd., [cited ]. URL: DataSheets/DS_FT245R.pdf. [23] Deinterlacing (odstraneni radkoveho prokladu) [online]. Reboot magazine, [cited ]. URL: deinterlacing-odstraneni-radkoveho-prokladu/articles. html?id=206. [24] FPGA designer curriculum [online]. Altera Corp., [cited ]. URL: fpga/trn-fpga.html. 49

55 11. CONCLUSION [25] Avalon Interface Specification [online]. Altera Corp., [cited ]. URL: spec.pdf. [26] Avalon Memory-Mapped Master Templates [online]. Altera Corp., [cited ]. URL: nios2/exm-avalon-mm.html. [27] The Efficiency of the DDR & DDR2 SDRAM Controller Compiler [online]. Altera Corp., [cited ]. URL: literature/wp/wp_ddr_sdram_efficiency.pdf. [28] Altera Forums [online]. Altera Corp., [cited ]. URL: [29] Cyclone V Device Family Advance Information Brief [online]. Altera Corp., [cited ]. URL: cyclone-v/cyv_51001.pdf. Products and/or technologies referenced in this work may be registered trademarks of their respective owners. 50

56 Appendix A Pin placement The pin placement illustration for the target device EP4CE15F17C7N. This placement is also compatible with the alternative EP3C16F256C7N. 51

57 Appendix B Device floorplan The resulting floorplan for the target device EP4CE15F17C7N showing a relatively high resource usage of the chip by the design. 52

58 Appendix C Slack histogram This chart shows the TimeQuest timing analysis slack distribution histogram for the final design. 53