Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Similar documents
Canny Edge Detection

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Vision based Vehicle Tracking using a high angle camera

A Computer Vision System on a Chip: a case study from the automotive domain

Convolution. 1D Formula: 2D Formula: Example on the web:

EVALUATION OF MULTI-CORE ARCHITECTURES FOR IMAGE PROCESSING ALGORITHMS

A New Image Edge Detection Method using Quality-based Clustering. Bijay Neupane Zeyar Aung Wei Lee Woon. Technical Report DNA #

QUALITY TESTING OF WATER PUMP PULLEY USING IMAGE PROCESSING

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Analecta Vol. 8, No. 2 ISSN

High-Performance Modular Multiplication on the Cell Processor

Signature Region of Interest using Auto cropping

REAL TIME TRAFFIC LIGHT CONTROL USING IMAGE PROCESSING

A General Framework for Tracking Objects in a Multi-Camera Environment

Speed Performance Improvement of Vehicle Blob Tracking System

Generations of the computer. processors.

Edge detection. (Trucco, Chapt 4 AND Jain et al., Chapt 5) -Edges are significant local changes of intensity in an image.

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Computer Graphics Hardware An Overview

Computational Foundations of Cognitive Science

Image Gradients. Given a discrete image Á Òµ, consider the smoothed continuous image ܵ defined by

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

Morphological segmentation of histology cell images

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Automatic Labeling of Lane Markings for Autonomous Vehicles

EFFICIENT VEHICLE TRACKING AND CLASSIFICATION FOR AN AUTOMATED TRAFFIC SURVEILLANCE SYSTEM

jorge s. marques image processing

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Next Generation GPU Architecture Code-named Fermi

Chapter 2 Logic Gates and Introduction to Computer Architecture

Radeon HD 2900 and Geometry Generation. Michael Doggett

2.2 Creaseness operator

Rethinking SIMD Vectorization for In-Memory Databases

ENTTEC Pixie Driver API Specification

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Data Storage 3.1. Foundations of Computer Science Cengage Learning

NVIDIA GeForce GTX 580 GPU Datasheet

Lab 2.0 Thermal Camera Interface

Bildverarbeitung und Mustererkennung Image Processing and Pattern Recognition

Real-time Visual Tracker by Stream Processing

Fingerprint s Core Point Detection using Gradient Field Mask

High Quality Image Magnification using Cross-Scale Self-Similarity

QCD as a Video Game?

A Study on SURF Algorithm and Real-Time Tracking Objects Using Optical Flow

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Introduction to Cloud Computing

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Automatic Traffic Estimation Using Image Processing

Optimizing Code for Accelerators: The Long Road to High Performance

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

Tutorial for Tracker and Supporting Software By David Chandler

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

Figure 1: Graphical example of a mergesort 1.

LIST OF CONTENTS CHAPTER CONTENT PAGE DECLARATION DEDICATION ACKNOWLEDGEMENTS ABSTRACT ABSTRAK

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

3D Scanner using Line Laser. 1. Introduction. 2. Theory

FCE: A Fast Content Expression for Server-based Computing

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Integer Computation of Image Orthorectification for High Speed Throughput

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

Stream Processing on GPUs Using Distributed Multimedia Middleware

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Chapter 1 Computer System Overview

What is LOG Storm and what is it useful for?

ROBOTRACKER A SYSTEM FOR TRACKING MULTIPLE ROBOTS IN REAL TIME. by Alex Sirota, alex@elbrus.com

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

on an system with an infinite number of processors. Calculate the speedup of

Saving Mobile Battery Over Cloud Using Image Processing

Lesson 10: Video-Out Interface

Microprocessor & Assembly Language

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

A numerically adaptive implementation of the simplex method

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Computer Architecture TDTS10

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Scalability and Classifications

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

Introduction to graphics and LCD technologies. NXP Product Line Microcontrollers Business Line Standard ICs

Basler. Line Scan Cameras

Control 2004, University of Bath, UK, September 2004

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

İSTANBUL AYDIN UNIVERSITY

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPGPU. Tiziano Diamanti

ROBUST VEHICLE TRACKING IN VIDEO IMAGES BEING TAKEN FROM A HELICOPTER

Acceleration of Spiking Neural Networks in Emerging Multi-core and GPU Architectures

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

BCC Multi Stripe Wipe

Transcription:

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve the efficiency and performance of Canny Edge Detector by implementing it on CELL/B.E. Architecture. The Canny edge detection algorithm involves a large number of matrix operations. The convolution and other matrix operations can be carried out efficiently on CELL/B.E. architecture because of its parallelized job processing ability and the SIMD Instructions. About 10% of extra edges can be found using color images, but for these extra edges, the amount of processing triples as it has to be done for each of the Red, Green, Blue images. Introduction Applications like traffic control, medical imaging, surveillance, etc require a great deal of real time video and image processing. Real time image processing would require a great deal of computational power. A standard camera captures a NTSC video at rate of 30 frames per second and each frame has 720x480 pixels performing image processing on such a large image at high rate is a very difficult task. Edge detection is a fundamental tool used in most image processing applications to obtain information from the frames as a precursor step to feature extraction and object segmentation. This process detects outlines of an object and boundaries between objects and the background in the image. An edge-detection filter can also be used to improve the appearance of blurred or anti-aliased video streams. While detecting edges and other important feature, gradient estimation is one of the important steps. Gradient estimation requires convolution operation between image matrix and kernels. The gradient calculation can be executed as a parallel task, that means for calculating the gradient magnitude for a pixel the operation does not depend on gradient magnitude of another pixel in the same image. In this project we use the canny edge detection algorithm to find the edges. We apply the edge detection algorithm to color images, in order to get the best edges. The reason behind this is, when edge detection is performed over gray scale images then edge may not be detected if the gray scale value of two nearby pixels does not differ much. On the other hand the nearby pixels may differ from each other largely in color, but not much in the average of the three colors. So when color difference is calculated for such pixel, the edges are detected. Though using of color images for edge detection improves the edges, the computations are also increased because now we have to find the gradient image for red, green and blue images. This computational overhead is compensated by the use of CELL/B.E. architecture.

Edge Detection The basic edge-detection operator is a matrix area gradient operation that determines the level of variance between different pixels. The edge-detection operator is calculated by forming a matrix centered on a pixel chosen as the center of the matrix area. If the value of this matrix area is above a given threshold, then the middle pixel is classified as an edge. Examples of gradient-based edge detectors are Roberts, Prewitt, and Sobel operators. But these operators are of fixed size so in order to have a variable size operator we can use Canny edge detection technique in which Gaussian kernels are used for gradient estimation. Canny edge detection algorithm involves mainly three steps for estimating the edges for a given image and they are namely convolving image with Gaussian and derivative of Gaussian kernels to get gradient and direction of gradient, non maximal suppression for refining the edges and thresholding for separating the edges from the rest of the image. In gradient estimation we perform the convolution using derivative of Gaussian function. The Gaussian function is used for smoothing and for finding gradient because it is completely described by 1 st and 2 nd order statistics also the derivative of Gaussian produces differentiating kernel while the simple Gaussian function is used for smoothing. In gradient calculation we perform two tasks first is smoothing along one axis and then convolving it with derivative of other axis. Then the magnitude of gradient is calculated and also its direction is calculated for each pixel. Once gradient have been found the edges are not sharp. In order to make edges more sharp and to reduce the unwanted edge pixels we use method of non maximal suppression. In this method the pixel value is decided based on its gradient direction. If say direction of a pixel is between 0 and 22.5 degrees than the pixel on right and pixel on left are compared if any of them is having gradient value more than this pixel than its value is set to zero else its value is kept as it is. After getting the non maximal suppressed image we have to find the edge pixels. For implementing this we have first sort the pixels which are having non zero values. After sorting a high threshold level is set such that 10% of the pixels are above it. For lower threshold value we take some lower value which is a multiple of high threshold like one fifth of high threshold. Each pixel is checked for threshold value if its value is above high threshold than it is set to black if it is lower than lower threshold than it is set to white. For pixels which have values between high threshold and low threshold we check whether any of the eight neighbor are having value more than high threshold and if there is such neighbor than this pixel is set to black else it is set to zero. The performance of the Canny algorithm depends heavily on the adjustable parameters, σ, which is the standard deviation for the Gaussian filter, and the threshold values, that is higher threshold and lower threshold. The bigger the value for σ, the larger the size of the Gaussian filter becomes. This implies more blurring, necessary for noisy images, as well as detecting larger edges. However, the larger the scale of the Gaussian, the less accurate is the localization of the edge. Smaller values of σ imply a smaller Gaussian filter which limits the amount of blurring, maintaining finer edges in the image. CELL/B.E. Architecture Cell is architecture for high performance distributed computing. It is comprised of hardware and software Cells, software Cells consist of data and programs (known as jobs or apulets), these are sent out to the hardware Cells where they are computed, the results are then returned. A basic configuration of Cell Architecture comprises of one Power Processor Element(PPE) and eight Synergistic Processing Elements(SPE). The PPE and SPE are connected together by an internal high speed bus. PPE is multithreaded core and is the controller for the SPEs. The PPE is similar to other 64 bit Power PC and so works with conventional operating system. The clock speed for PPE is 3.2 GHz. The PPE

create threads and these threads are carried to SPEs for performing the mathematical operations. The SPEs then send back the results of operation back to PPE. An SPE is a RISC processor. Each SPE has got a local storage of 256 KB. An SPE can operate on 16 8-bit integers, 8 16-bit integers, 4 32-bit integers, or 4 single precision floating-point numbers in a single clock cycle. It can also do a memory operation in the same clock cycle. The SPU processor cannot directly access system memory; the 64-bit memory addresses formed by the SPU must be passed from the SPU processor to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. The number of SPEs that can be used in a specific application differs, for example in case of PS3 it can use 6 out of the 8 SPEs. Scaling is just one capability of the Cell architecture but the individual systems are going to be potent enough on their own. An individual Cell have a theoretical computing capability of 256 GFLOPS (Billion Floating Point Operations per Second) at 4GHz. Cell may be unusual in that given the right type of problem they may actually be able to get close to their maximum computational figure. The Cell s hardware has been specifically designed to provide sufficient data to the computational elements to enable such performance. This is a rather different approach from the usual way which is to hide the slower parts of the system. All systems are limited by their slowest components, Cell was designed not to have any slow components. The main program executes on PowerPC and the calculation part is transferred to SPE s. SPEs perform the vector multiplication directly on 128 bits. This makes the execution more faster. The operating system running on PowerPC is Fedora Core 6 GNU/Linux. Image courtesy: Nicholas Blachford, http:// www.blachford.info/computer/article/cellprogramming1.html

Implementation The test image is a bmp file, first its header is read and copied to the output file. Three arrays are initialized corresponding to Red, Green and Blue. For each pixel the corresponding value of the color is stored in three arrays. The PowerPC supports the Big endian format but the bitmap file has got the values in little endian format so the necessary conversions are done. In order to have maximum paralleled processing, computationally demanding processes are split amongst the six SPUs. Initialization of the SPU is done at the PPU by the creation of threads. Operations that require higher computational resources and that are independent with respect to other parts of the image except its immediate neighborhood are processed in the SPU. The amount of data storage on the SPU local store is limited to 256 KB, so the image to be processed in our case, is divided into six parts and send to the SPU for processing. The data transfer from PPU to SPU is done by direct memory access. Each DMA transfer instruction can transfer upto 16 KB. Multiple DMA transfers can be initialized and grouping is also possible by the use of tags. In this implementation, the image to each SPU is transferred by six DMA transfers. The processing of image is done by use of SIMD (single instruction multiple data) instructions to increase efficiency. The computed data is transferred back to the PPU by DMA transfers again. Three gradient value for each pixel is then compared and the maximum of the three is stored in a different array also the direction associated with this gradient value is stored in a array. The gradient and its direction are passed onto the function for performing non maximal suppression. After the non maximal suppression the values are sent to thresholding function. In thresholding function all the values are first arranged in ascending order and the 10 % of the initial values is taken as higher threshold and 65% is taken as lower threshold value. Then depending in which range the value of pixels lie the decision is made whether a pixel is a edge pixel or not. If a pixel value is between high threshold and low threshold, its eight neighbors are checked if any one of them is having value greater than high threshold than this pixel is also treated as edge pixel. The final edges are written back to the output file.

Results We used the lena.bmp for detecting the edges. The value of sigma was 1 and kernel size was kept 3x3. Original Image Gradient Image for Red Gradient Image for Green Gradient Image for Blue Final Image Other Images: Color Image Color Canny Edge Image Grey Scale Image Grey Scale Canny Image

Conclusion The Canny Edge detector was successfully implemented onto CELL/B.E. architecture and the results were found to be satisfactory. The implementation on CELL/B.E. architecture reduces the time taken for producing edges as compared to normal 64 bit processors. The further optimization could be done by improving on the techniques for DMA transfers and vector multiplications. The main optimizations possible are double buffering the DMA transfers, finding the optimum transfer size and interlacing the transfer instructions along with the computations. The other possible optimizations possible are, a method to maximize the use of vectors used for multiplications. Acknowledgement We extremely are grateful to Dr Stanley Birchfield for providing us the guidance during the course of project. We are also thankful to Dr Tarek Taha for his support and guidance for programming on Playstation 3. References 1. Canny, J., A Computational Approach To Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679-714, 1986. 2. Adaptive Edge Detection for Real Time Video Processing using FPGA s by Hong Shan Neoh, Asher Hazanchuk, Altera Corporation, San Jose, CA 3. http://www.blachford.info/computer/cell/cell0_v2.html 4. http://aser.ornl.gov/presentations/ps3_cell_overview-ryan_kerekes.pdf 5. http://cell.fixstars.com/opencv/index.php/opencv_on_the_cell