ConsumerBench 1.1. software benchmark data book. www.eembc.org

CnsumerBench 1.1 sftware benchmark data bk

Table f Cntents High Pass Grey-Scale Filter...2 JPEG...4 RGB t CMYK Cnversin...8 RGB t YIQ Cnversin... 10 1

CnsumerBench Versin 1.1 Benchmark Name: High Pass Grey-Scale Filter Highlights Benchmarks perfrmance fr digital image prcessing used in digital still camera and ther digital image prducts Explres 2-D data array access and multiply / accumulate capability. This benchmark has ptential fr Full-Fury benchmark ptimizatin, especially by SIMD and VLIW architectures. Applicatin A high pass grey-scale filter is used in the frnt end prcessing f DSCs (Digital Still Camera). RGB data frm either CCD r CMOS sensrs is preprcessed by this filter t deliver image enhancement, and then passed t the JPEG image cmpressin prcessing. This filter takes a blurry image and sharpens it with a 2-dimensinal spatial filter. DSCs implement this filter either in sftware r hardware, with sftware giving the flexibility t add custmizatin fr picture quality. The number f filter taps can vary frm 3(H) x 3(V) t mre than 5(H) x 5(V). This benchmark is ne f the mst frequently used algrithms in image prcessing and represents a gd measure f the CPU perfrmance in digital imaging prducts. Benchmark Descriptin This benchmark explres the target CPU s capability t perfrm tw dimensinal data array access and multiply/accumulate calculatin. Fr each pixel in the image, the filter calculates the utput result frm the 9 pixels (including the center pixel) multiplied by filter cefficients, accumulated and then shifted left by 8-bits. The 2-dimensinal cefficients used here are: F11 F21 F31 F12 F22 F32 = 255 F13 F23 F33 Each pixel is cmputed accrding t the fllwing equatin:. PelValue = (Shrt)( F11*P(c-w-1) +F21*P(c-w) +F31*P(c-w+1) +F12*P(c-1) +F22*P(c) +F32*P(c+1) +F13*P(c+w-1) +F23*P(c+w) +F33*P(c+w+1) ) Out = (Byte)(PelValue >>8); Here, P(i) is the pixel intensity, c is the center lcatin f the filter windw, w is the width f the input image. The data type f P(i) is Byte, and the tw dimensinal data is arranged in a linear way. Therefre additin r 2

subtractin f the hrizntal image width w and ffset f -1 r +1 are required t retrieve the 2-dimensin windw data. The accumulatin is perfrmed as a 16-bit data and the final utput data is cnverted t a Byte data after a shift right by 8-bits. The tp/left and right/left brders are black ut by assigning BLACK value f 0. The input data size is 320-pixels in the hrizntal directin and 240-pixels in the vertical directin. This is a mnchrme r gray-scale calculatin. It is nt an RGB calculatin where the same prcess is perfrmed three times. Usually the enhancement is perfrmed just in the luminance signal Y, which is the gray-scale signal. If the benchmark scre is extraplated fr a larger image, the prcessing time will be almst linearly prprtinal t the pixel cunt (e.g. Fr a 640 x 480 image, it will be x4 times). The iteratin/sec scre will be the inverse e.g. fr a 640 x 480 image, iteratin/sec it will be x1/4. Analysis f Cmputing Resurces Out f the Bx Benchmark: A fr lp calculates the filter utput ne pixel at a time. Fr ne pixel calculatin, the center pixel itself and the eight neighbr pixel data shuld be laded. This is a time cnsuming prcess, cnsidering the ffset/width index calculatin, and the time spent fr the memry r cache access. Higher perfrmance wuld be expected frm a micrprcessr with a single-cycle MAC unit. Full-Fury Benchmark: Because f the simple structure f the multiplicatin and accumulatin, a VLIW r SIMD architecture with multiple MAC units are able t ffer a simple acceleratin. Anther pssible ptimizatin is lading multiple Bytes at a time, althugh a SIMD architecture may shw sme verhead fr the rearranging the data t feed the SIMD engine. Regarding the memry architecture, the image data is repeatedly used fr the cnsecutive windw and can benefit frm a Data Cache. The cde size is trivial and will easily fit in t a small L1 Instructin Cache. 3

CnsumerBench Versin 1.1 Highlights Benchmarks Ptential Perfrmance fr still picture image cding (e.g. still camera) Benchmark Name: JPEG Cmpressin and decmpressin benchmarks Integer math, with diverse perand types, accessing large image memries Applicatin The JPEG cmpressin benchmark takes an image and encdes it t prduce a cmpressed representatin. The JPEG image cmpressin standard prvides fr a wide range f ptins in the way that images are cmpressed. The benchmark uses the baseline subset f image cmpressin tls with parameters that wuld generally be regarded as typical. The JPEG decmpressin benchmark essentially reverses the prcess f the cmpressin benchmark. Since the cmpressed image that is used by the benchmark is that prduced by the cmpressin benchmark (abve) it uses the same set f image cding tls and parameters. This is the particular image prcessed during these benchmarks: The benchmark prvides an indicatin f the ptential perfrmance f a micrprcessr in an applicatin requiring still-image cmpressin and decmpressin (fr example a still picture camera). Benchmark Descriptin The JPEG cmpressin benchmark takes an image and encdes it. The image used in the benchmark is f relatively lw reslutin (320 pixels by 240 lines) represented in the RGB (Red-Green-Blue) clr space, with each cmpnent being represented by 8-bit data. 4

The benchmark first perfrms a number f preprcessing steps n the image data: The image is clr-space cnverted t a YCrCb clr space that uses a luminance, Y, cmpnent (a black-and-white image) tgether with tw clr-difference cmpnents, Cr and Cb. The tw clr difference cmpnents are scaled s as t have ne half the number f pixels and ne half the number f lines as the luminance cmpnent. The benchmark cde then prduces the JPEG header infrmatin which includes data abut the size and nature f the image as well as the detailed quantizatin matrices and Huffman cde tables that are being used. (These are required in rder fr the JPEG decmpressin t decde the resulting bitstream.) The JPEG algrithm then segments the image t be cded int a series f MCUs (Minimum Cded Unit) cnsisting f fur 8x8 pixel blcks f the luminance cmpnent and the crrespnding 8x8 pixel blcks fr each f the tw clr difference cmpnents. Each f these 8x8 pixel blcks is then prcessed as fllws: 2-D DCT A tw-dimensinal transfrm is perfrmed n the data resulting in an 8x8 array f frequency-dmain cefficients fr the blck. A fast algrithm analgus t the FFT (Fast-Furier-Transfrm) is used. The particular decmpsitin is such that 16 ne-dimensinal 8-pint transfrms are perfrmed each requiring 12 multiples and 32 adds. Quantizatin Each f the frequency-dmain cefficients is divided by a scale factr unique t that particular spatial frequency t yield an integer cde. This will subsequently be used in a decder (by multiplying by the scale factr) t derive an apprximatin t the riginal frequencydmain cefficient. JPEG is lssy in the sense that the decded images are an apprximatin t the riginal images and it is at this stage that infrmatin is lst. Zig-Zag scan The quantized cefficients are scanned in a zig-zag fashin t prduce a 1-D sequence f 64 cefficients. A large number f these cefficients that are zer. Each nn-zer cefficient is represented as a SIZE value. The number f zer cefficients preceding the nnzer cefficient is referred t as the RUN. Huffman encde Each pssible cmbinatin f RUN and SIZE is allcated a unique Huffman cde wrd such that statistically likely RUN-SIZE 5

cmbinatins have shrt cde wrds while RUN-SIZE cmbinatins that ccur infrequently have lng cde wrds. The apprpriate Huffman cde wrd is lked up (a table lkup peratin) inserted int the bitstream and fllwed by SIZE binary digits t specify the value f the quantized cefficient. (This descriptin glsses ver many details but gives a general feel fr the peratins perfrmed.) The JPEG decmpressin benchmark essentially perfrms the same series f steps, but in reverse. Huffman decding is a smewhat mre cmplex peratin t Huffman encding (which is just a simple table lkup). The Inverse Quantizatin stage invlves multiplicatin rather than divisin and may therefre be less demanding n many prcessrs. Hwever in brad terms the cmputatinal burden f decding JPEG is similar t that f encding. Analysis f Cmputing Resurces The JPEG benchmarks use a wide range f types f peratins: Operatins n 8-bit data fr the scaling and clr space pre- and pstprcessing stages. Extensive arithmetic n 16-bit data in the transfrm (DCT) and quantizatin stages with varius intermediate values requiring mre than 16-bits. Table lkup and lw-level bit manipulatin peratins fr Huffman cding and decding and assembling and unpacking the cded bitstream. The image used in the benchmark is relatively small (320 pixels by 240 lines) with three bytes per pixel. (A ttal f 225 Kbytes.) JPEG is reasnably scalable and engineers might bradly expect the time required t prcess an image t scale prprtinally with the number f pixels. Hwever, the cmputatinal demands f JPEG are dependent n image cntent, particularly in the entrpy (Huffman) cding sectin, and since the statistical cntent f typical images des vary with image reslutin cautin shuld be exercised in scaling perfrmance ver a wide range. Users f these benchmark results shuld als be aware that it is NOT designed t indicate wrst-case perfrmance characteristics. The cmputatinal demands f JPEG are dependent n the specific image being cded (r decded) and the chice f cding tls and parameters that are chsen. Thugh the image and encded data buffer sizes in memry are large, the algrithm prceeds blck-by-blck and the data fr a blck will generally be in cache. The high spatial and tempral lcality f reference inherent in the JPEG algrithm allws prcessrs t make gd use f caches. Even small data caches wrk well, with miss rates decreasing as cache size increases, there is 6

n knee in the curve where perfrmance increases markedly because the data fits within the cache. The small size f the algrithm kernel will fit in even very small instructin caches (fr example 4Kbytes), hwever the ttal cde size is significantly larger s that larger instructin caches d prvide additinal perfrmance benefit. 7

CnsumerBench Versin 1.1 Benchmark Name: RGB t CMYK Cnversin Highlights Benchmarks digital image prcessing perfrmance in printers and ther digital imaging prducts. Explres basic arithmetic and minimum value detectin capability. This benchmark prvides pprtunities fr Full-Fury benchmark ptimizatin. Cnditinal mve and multi-byte prcessing SIMD r VLIW architectures are effective fr example. Applicatin RGB t CMYK cnversin is widely used in clr printers. RGB inputs frm PC data is cnverted t CMYK clr signals fr printing. Benchmark Descriptin This benchmark explres the target CPU capability fr basic arithmetic and minimum value detectin. R, G, B 8-bit pixel clr image input is fed t the fllwing equatin: /* calculate cmplementary clrs */ c = 255 R; m = 255 G; y = 255 B; /* find the black level k */ K = minimum (c,m,y) /* crrect cmplementary clr lever based n k */ C = c K M = m K Y = y - K RGB values are in the range f [0:255].. CMYK values are in the range f [0:255].. The input and utput data size is 320-pixels in the hrizntal directin and 240-pixels in the vertical directin. The 320x240 data fr RGB and CMYK is stred sequentially as. R[0], G[0], B[0], R[1], G[1], B[1], R[76799], G[76799], B[76799] C[0], M[0], Y[0], K[0], C[1], M[1], Y[1],K[1] C[76799], M[76799], Y[76799], K[76799] The pinters are just incremented by ne t access R, G, B r C, M, Y, K data is this rder. 8

If the benchmark scre is extraplated fr a larger image, the prcessing time will be almst linearly prprtinal t the pixel cunt (e.g. Fr a 640 x 480 image, it will be x4 times.) The iteratin/sec scre will be the inverse e.g. fr a 640 x 480 image, iteratin/sec it will be x1/4. There is data dependency in the cycle cunts fr the minimum value K search, due t branch taken r nt taken. If this peratin is handled by cnditinal mve, the cycle will cnstant. Analysis f Cmputing Resurces Out f the Bx Benchmark: A fr lp calculates the cnversin f a set f RGB inputs and CMYK utputs at a time. A set f R, G, B input data is read frm the memry by incrementing a read pinter. A set f utput C, M, Y, K utput data is written back t the memry by incrementing a write pinter. There is n cmplex 2-dimensnal access like the high pass grey-scale filter benchmark. The cmplementary clr calculatin and crrectin are simple subtract calculatins withut any MAC peratin. The minimum value search has tw branches fr prcessing each pixel. If (c<m) { K = (Byte)(c<y? c:y); } else { K = (Byte)(m<y? m:y); } This can be a very expensive rutine because f the branch penalty. Full-Fury Benchmark: By using cmpare and cnditinal mves, the branch penalty can be avided. VLIW and SIMD can prcess multiple Byte f data at a time. A SIMD architecture which can handle multiple f Byte data at a time, is especially suited t this benchmark e.g. A 4-way SIMD micrprcessr can handle 4 x 8-bit data every cycle. Special Ntes Regarding the memry architecture, the image data is used just nce and there is n benefit frm a big Data Cache, unless the micrprcessr has a cache prefetch feature. A small Data Cache will wrk t fetch cnsecutive data and avid external memry access verhead. The cde size is trivial and easily fits in t a small L1 Instructin Cache. 9

CnsumerBench Versin 1.1 Benchmark Name: RGB t YIQ Cnversin Highlights Benchmarks perfrmance fr digital vide prcessing. Explres multiply / accumulate capability. This benchmark has pprtunities fr Full-Fury benchmark ptimizatins, especially by SIMD and VLIW architectures. Applicatin RGB t YIQ cnversin is used in the NTSC encder where the RGB inputs frm the camera are cnverted t a luminance (Y) and tw chrminance infrmatin (I,Q). In the NTSC encder, these I,Q signals are mdulated by a subcarrier and added t the Y signal. Histrically, when clr TVs appeared in the market, they had t cexist with the existing mnchrme TVs and this was made pssible with the NTSC signal structure. The chrminance signals are averaged ut as a fine mesh f invisible signals in the mnchrme TV sets. YUV used in the Eurpean PAL standard and YCbCr used in the JPEG standard have different cdings. All three standards share the same luminance signal Y but the chrminance calculatins are different. The matrix calculatin scheme used in the RGB t YIQ can be applied t these standards t. In the actual prducts, this trivial calculatin is usually perfrmed in dedicated hardware, especially in digital vide prducts. Fr cst saving and flexibility, this algrithm can be implemented in sftware if the CPU is pwerful enugh and where the digital image is a still picture. Benchmark Descriptin This benchmark explres the capability f the CPU t perfrm a straightfrward matrix multiply/accumulate calculatin. The R, G, B 8-bit pixel clr image input is prcessed as fllws: Y = 0.299*R + 0.587*G + 0.114*B I = 0.596*R 0.275*G 0.321*B Q = 0.212*R 0.523*G + 0.311*B RGB values are in the range f [0:255]. The cnversin cefficients are 16- bits. The multiply/accumulate results are shifted right by 16-bits. Befre the shift, 1 is added t a bit lcatin right t the LSB f the shifted result fr runding t the nearest integer. The utput is 8-bit data. Y is in the range f [0,255] and I,Q in the range f [-127, 127]. The input and utput data size is 320-pixels in the hrizntal directin and 10

240-pixels in the vertical directin. The 320x240 data fr RGB and YIQ are stred sequentially as. R[0], G[0], B[0], R[1], G[1], B[1], R[76799], G[76799], B[76799] Y[0], I[0], Q[0], Y[1], I[1], Q[1], Y[76799], I[76799], Q[76799] The pinters are just incremented by ne t access R, G, B r Y, I, Q data is this rder. Analysis f Cmputing Resurces Out f the Bx Benchmark: A fr lp calculates the cnversin f a set f RGB inputs and YIQ utputs at a time. A set f R, G, B input data is read frm the memry by incrementing a read pinter. A set f utput Y, I, Q utput data is written back t the memry by incrementing a write pinter. There is n cmplex 2-dimensnal access such as that in the high pass greyscale filter benchmark. The calculatin is a straightfrward multiplicatin and accumulatin that a micrprcessr with a single-cycle MAC unit will benefit frm. The cde size is trivial and easily fits in t a small L1 Instructin Cache. Full-Fury Benchmark: Because f the simple structure f the multiplicatin and accumulatin, a VLIW r SIMD architecture with multiple f MAC units can be used t accelerate perfrmance. A further ptimizatin is the lading f multiple Bytes at a time. Sftware pipelining culd be used t pass the laded data efficiently t the MAC unit fr calculatin. 11