Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve the efficiency and performance of Canny Edge Detector by implementing it on CELL/B.E. Architecture. The Canny edge detection algorithm involves a large number of matrix operations. The convolution and other matrix operations can be carried out efficiently on CELL/B.E. architecture because of its parallelized job processing ability and the SIMD Instructions. About 10% of extra edges can be found using color images, but for these extra edges, the amount of processing triples as it has to be done for each of the Red, Green, Blue images. Introduction Applications like traffic control, medical imaging, surveillance, etc require a great deal of real time video and image processing. Real time image processing would require a great deal of computational power. A standard camera captures a NTSC video at rate of 30 frames per second and each frame has 720x480 pixels performing image processing on such a large image at high rate is a very difficult task. Edge detection is a fundamental tool used in most image processing applications to obtain information from the frames as a precursor step to feature extraction and object segmentation. This process detects outlines of an object and boundaries between objects and the background in the image. An edge-detection filter can also be used to improve the appearance of blurred or anti-aliased video streams. While detecting edges and other important feature, gradient estimation is one of the important steps. Gradient estimation requires convolution operation between image matrix and kernels. The gradient calculation can be executed as a parallel task, that means for calculating the gradient magnitude for a pixel the operation does not depend on gradient magnitude of another pixel in the same image. In this project we use the canny edge detection algorithm to find the edges. We apply the edge detection algorithm to color images, in order to get the best edges. The reason behind this is, when edge detection is performed over gray scale images then edge may not be detected if the gray scale value of two nearby pixels does not differ much. On the other hand the nearby pixels may differ from each other largely in color, but not much in the average of the three colors. So when color difference is calculated for such pixel, the edges are detected. Though using of color images for edge detection improves the edges, the computations are also increased because now we have to find the gradient image for red, green and blue images. This computational overhead is compensated by the use of CELL/B.E. architecture.
Edge Detection The basic edge-detection operator is a matrix area gradient operation that determines the level of variance between different pixels. The edge-detection operator is calculated by forming a matrix centered on a pixel chosen as the center of the matrix area. If the value of this matrix area is above a given threshold, then the middle pixel is classified as an edge. Examples of gradient-based edge detectors are Roberts, Prewitt, and Sobel operators. But these operators are of fixed size so in order to have a variable size operator we can use Canny edge detection technique in which Gaussian kernels are used for gradient estimation. Canny edge detection algorithm involves mainly three steps for estimating the edges for a given image and they are namely convolving image with Gaussian and derivative of Gaussian kernels to get gradient and direction of gradient, non maximal suppression for refining the edges and thresholding for separating the edges from the rest of the image. In gradient estimation we perform the convolution using derivative of Gaussian function. The Gaussian function is used for smoothing and for finding gradient because it is completely described by 1 st and 2 nd order statistics also the derivative of Gaussian produces differentiating kernel while the simple Gaussian function is used for smoothing. In gradient calculation we perform two tasks first is smoothing along one axis and then convolving it with derivative of other axis. Then the magnitude of gradient is calculated and also its direction is calculated for each pixel. Once gradient have been found the edges are not sharp. In order to make edges more sharp and to reduce the unwanted edge pixels we use method of non maximal suppression. In this method the pixel value is decided based on its gradient direction. If say direction of a pixel is between 0 and 22.5 degrees than the pixel on right and pixel on left are compared if any of them is having gradient value more than this pixel than its value is set to zero else its value is kept as it is. After getting the non maximal suppressed image we have to find the edge pixels. For implementing this we have first sort the pixels which are having non zero values. After sorting a high threshold level is set such that 10% of the pixels are above it. For lower threshold value we take some lower value which is a multiple of high threshold like one fifth of high threshold. Each pixel is checked for threshold value if its value is above high threshold than it is set to black if it is lower than lower threshold than it is set to white. For pixels which have values between high threshold and low threshold we check whether any of the eight neighbor are having value more than high threshold and if there is such neighbor than this pixel is set to black else it is set to zero. The performance of the Canny algorithm depends heavily on the adjustable parameters, σ, which is the standard deviation for the Gaussian filter, and the threshold values, that is higher threshold and lower threshold. The bigger the value for σ, the larger the size of the Gaussian filter becomes. This implies more blurring, necessary for noisy images, as well as detecting larger edges. However, the larger the scale of the Gaussian, the less accurate is the localization of the edge. Smaller values of σ imply a smaller Gaussian filter which limits the amount of blurring, maintaining finer edges in the image. CELL/B.E. Architecture Cell is architecture for high performance distributed computing. It is comprised of hardware and software Cells, software Cells consist of data and programs (known as jobs or apulets), these are sent out to the hardware Cells where they are computed, the results are then returned. A basic configuration of Cell Architecture comprises of one Power Processor Element(PPE) and eight Synergistic Processing Elements(SPE). The PPE and SPE are connected together by an internal high speed bus. PPE is multithreaded core and is the controller for the SPEs. The PPE is similar to other 64 bit Power PC and so works with conventional operating system. The clock speed for PPE is 3.2 GHz. The PPE
create threads and these threads are carried to SPEs for performing the mathematical operations. The SPEs then send back the results of operation back to PPE. An SPE is a RISC processor. Each SPE has got a local storage of 256 KB. An SPE can operate on 16 8-bit integers, 8 16-bit integers, 4 32-bit integers, or 4 single precision floating-point numbers in a single clock cycle. It can also do a memory operation in the same clock cycle. The SPU processor cannot directly access system memory; the 64-bit memory addresses formed by the SPU must be passed from the SPU processor to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. The number of SPEs that can be used in a specific application differs, for example in case of PS3 it can use 6 out of the 8 SPEs. Scaling is just one capability of the Cell architecture but the individual systems are going to be potent enough on their own. An individual Cell have a theoretical computing capability of 256 GFLOPS (Billion Floating Point Operations per Second) at 4GHz. Cell may be unusual in that given the right type of problem they may actually be able to get close to their maximum computational figure. The Cell s hardware has been specifically designed to provide sufficient data to the computational elements to enable such performance. This is a rather different approach from the usual way which is to hide the slower parts of the system. All systems are limited by their slowest components, Cell was designed not to have any slow components. The main program executes on PowerPC and the calculation part is transferred to SPE s. SPEs perform the vector multiplication directly on 128 bits. This makes the execution more faster. The operating system running on PowerPC is Fedora Core 6 GNU/Linux. Image courtesy: Nicholas Blachford, http:// www.blachford.info/computer/article/cellprogramming1.html
Implementation The test image is a bmp file, first its header is read and copied to the output file. Three arrays are initialized corresponding to Red, Green and Blue. For each pixel the corresponding value of the color is stored in three arrays. The PowerPC supports the Big endian format but the bitmap file has got the values in little endian format so the necessary conversions are done. In order to have maximum paralleled processing, computationally demanding processes are split amongst the six SPUs. Initialization of the SPU is done at the PPU by the creation of threads. Operations that require higher computational resources and that are independent with respect to other parts of the image except its immediate neighborhood are processed in the SPU. The amount of data storage on the SPU local store is limited to 256 KB, so the image to be processed in our case, is divided into six parts and send to the SPU for processing. The data transfer from PPU to SPU is done by direct memory access. Each DMA transfer instruction can transfer upto 16 KB. Multiple DMA transfers can be initialized and grouping is also possible by the use of tags. In this implementation, the image to each SPU is transferred by six DMA transfers. The processing of image is done by use of SIMD (single instruction multiple data) instructions to increase efficiency. The computed data is transferred back to the PPU by DMA transfers again. Three gradient value for each pixel is then compared and the maximum of the three is stored in a different array also the direction associated with this gradient value is stored in a array. The gradient and its direction are passed onto the function for performing non maximal suppression. After the non maximal suppression the values are sent to thresholding function. In thresholding function all the values are first arranged in ascending order and the 10 % of the initial values is taken as higher threshold and 65% is taken as lower threshold value. Then depending in which range the value of pixels lie the decision is made whether a pixel is a edge pixel or not. If a pixel value is between high threshold and low threshold, its eight neighbors are checked if any one of them is having value greater than high threshold than this pixel is also treated as edge pixel. The final edges are written back to the output file.
Results We used the lena.bmp for detecting the edges. The value of sigma was 1 and kernel size was kept 3x3. Original Image Gradient Image for Red Gradient Image for Green Gradient Image for Blue Final Image Other Images: Color Image Color Canny Edge Image Grey Scale Image Grey Scale Canny Image
Conclusion The Canny Edge detector was successfully implemented onto CELL/B.E. architecture and the results were found to be satisfactory. The implementation on CELL/B.E. architecture reduces the time taken for producing edges as compared to normal 64 bit processors. The further optimization could be done by improving on the techniques for DMA transfers and vector multiplications. The main optimizations possible are double buffering the DMA transfers, finding the optimum transfer size and interlacing the transfer instructions along with the computations. The other possible optimizations possible are, a method to maximize the use of vectors used for multiplications. Acknowledgement We extremely are grateful to Dr Stanley Birchfield for providing us the guidance during the course of project. We are also thankful to Dr Tarek Taha for his support and guidance for programming on Playstation 3. References 1. Canny, J., A Computational Approach To Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679-714, 1986. 2. Adaptive Edge Detection for Real Time Video Processing using FPGA s by Hong Shan Neoh, Asher Hazanchuk, Altera Corporation, San Jose, CA 3. http://www.blachford.info/computer/cell/cell0_v2.html 4. http://aser.ornl.gov/presentations/ps3_cell_overview-ryan_kerekes.pdf 5. http://cell.fixstars.com/opencv/index.php/opencv_on_the_cell