Cache and Bandwidth Aware Matrix Multiplication on the GPU

Transcription

1 Cache ad Badwidth Aware Matrix Multiplicatio o the GPU Jesse D. Hall Natha A. Carr Joh C. Hart Uiversity of Illiois Astract Recet advaces i the speed ad programmaility of cosumer level graphics hardware has sparked a flurry of research that goes eyod the realm of image sythesis ad computer graphics. We examie the use of the GPU (graphics processig uit) as a tool for scietific computig, y aalyzig techiques for performig large matrix multiplies i GPU hardware. A earlier method for multiplyig matrices o the GPU suffered from prolems of memory adwidth. This paper examies more efficiet algorithms that make the implemetatio of large matrix multiplicatio o upcomig GPU architectures more competitive, usig oly 25% of the memory adwidth ad istructios of previous GPU algorithms. 1 Itroductio The multiplicatio of matrices is oe of the most cetral operatios applied i scietific computig. Recet history has show cotiued research for etter tued algorithms that improve the efficiecy of matrix multiplicatio. The ATLAS system for automatic tuig of matrix multiplicatio for target CPU s has show much success [Whaley et al. 2001]. New advaces i PC chip desig (such as streamig SIMD extesios) has led to research ito how to est leverage moder mico-architectures for this task [Aerdee ad Baxter 2000]. Recetly, cosumer ased graphics processors (GPU s) have ecome icreasigly more powerful ad are startig to support programmale features. The parallelism i graphics hardware pipelies makes the GPU a strog cadidate for performig may computatioal tasks icludig matrix multiplicatio [Larso ad McAllister 2001]. We detail a ew approach for multiplyig matrices o GPU hardware. This approach takes advatage of multiple levels of parallelism foud i moder GPU hardware ad reduces the adwidth requiremets ecessary to make this techique effective. The architecture of moder GPUs relevat to this paper is descried i Sectio 2. I summary, GPUs receive 3D graphics primitives (typically triagles or quadrilaterals) specified as a set of vertices from the applicatio. These vertices are trasformed ito scree coordiates, ad a fragmet is geerated for each pixel covered y the primitive (geeratig these fragmets is called rasterizatio). Fragmets cotai iformatio such as colors ad texture coordiates which are iterpolated across the primitive from values associated with the vertices. These auxiliary attriutes are used to shade each fragmet, which results i a fial color which is writte to a pixel i the frameuffer. Oe of the most commo ways of shadig fragmets is to use auxiliary attriutes kow as texture coordiates to idex ito a previously supplied image (texture). Multiple sets of texture coordiates ca e used to retrieve colors from multiple textures; the results are comied to form the fial color for the fragmet. This is multitexturig. Fially, a shaded fragmet ca either replace the curret value of the pixel i the frame uffer or it ca e added to the curret value. This versio of the graphics pipelie is descried i [Woo et al. 1997]. The performace of GPUs comes from the fact that large amouts of parallelism are availale i this pipelie. I particular, each fragmet is idepet of all other fragmets, so they ca e processed i parallel. Processig of fragmets ca also e overlapped to hide pipelie stalls ad memory latecies, resultig i very efficiet use of the hardware. Multitexturig ca e used to multiply matrices [Larso ad McAllister 2001]. A m matrix ca e represeted y a greyscale texture, with the each pixel cotaiig a elemet of the matrix 1. These matrices ca e displayed o the scree y drawig a m -pixel rectage with the texture coordiates (0, 0), (0, 1), (m 1, 1), (m 1, 0) assiged to the vertices (clockwise startig with the upper-left vertex). Oe eefit of this is that we ca access the traspose of a matrix y drawig a m-pixel rectagle with texture coordiates (0, 0), (m 1, 0), (m 1, 1), (0, 1). The exact same texture is used for drawig the matrix trasposed ad utrasposed. We have oly chaged the mappig of the texture image oto the rectagle. Matrix multiplicatio performs C AB where A is a m l-elemet matrix ad B is a l -elemet matrix. By storig matrices A ad B as textures, we ca compute C i l multitexturig passes as show i Figure 1. Clear the scree. Set the drawig mode to overlay. Load texture texa with matrix A. Load texture texb with matrix B. Set the multitexturig mode to modulate. Set frameuffer write mode to accumulate. for i 0...l 1 draw a m -pixel rectagle with texa coords (0,i), (0,i), (m 1,i), (m 1,i), ad texb coords (i, 0), (i, 1), (i, 1), (i, 0). Scree cotais result of A B. Figure 1: The Larso-McAllister multipass algorithm for multiplyig two matrices. The texture coordiates (0,i), (0,i), (m 1,i), (m 1,i) replicate the ith colum across the etire rectagle, whereas the texture coordiates (i, 0), (i, 1), (i, 1), (i, 0) replicate the ith row. These textured rectagles are the comied as demostrated i Figure 2. 1 Textures are typically idexed usig (s, t), with s idexig the horizotal axis ad t idexig the vertical axis. This is the opposite of stadard matrix otatio, where the first idex represets the row ad the secod idex represetig the colum. I the rest of this paper, we ll use the matrix style rather tha the texture style. Also, texture coordiates have traditioally ee i the rage [0...1], with various filters used to geerate a color for idices that fall etwee pixels. We use a extesio [Kilgard 2001] to allow iteger idexig, ad disale filterig.

2 Pass 1 Pass 2 Pass 3 Pass 4 A col. 1 A col. 2 A col. 3 A col. 4 Brow1 Brow2 Brow3 Brow4 fragmet processor that descries the color each fragmet efore it is possily assiged to its correspodig pixel. The iputs to a fragmet shader are a set of program costats, iterpolated attriute data from triagle vertices, ad texture maps (locks of texture memory addressed y the texture coordiates). Before rerig, the fragmet shader is compiled ad loaded ito the graphics hardware. Primitives (i our case quadrilaterals) are the set dow the graphics pipelie ivokig the ealed fragmet shader as they are rasterized. The output of a fragmet shader is a output color plotted to the scree. A fragmet shader is alloted a fixed set of temporary registers R0...R. Each register holds a sigle 4-vector correspodig to the four color chaels red, gree, lue, alpha. The color chaels of each register may e accessed idividually as follows: Ri.c where c {r, g,, a}, i {1..}. Stadard arithmetic operatios are defied over the set of registers, such as additio ad multiplicatio. For example: R2. R1.a R0.g, assigs the lue chael of register R2 to e the sum of the alpha chaels of registers R1 with the gree chael of R0. Moder fragmet shaders allow for up to four-istructio to e executed simultaeously, much like that of the SIMD istructios foud i moder PC architectures. For example: R2 R1.agr R0.gga (1) Result A Figure 2: Demostratio of Larso-McAllister matrix multiplicatio o a pair of 4 4 matrices. (The output i this example is saturated, such that results greater tha oe appear uiformly white.) 2 Moder GPU Orgaizatio The graphics pipelie implemeted o graphics acceleratio hardware was classically orgaized i a series of trasformatios, clippig ad rasterizatio steps. Moder graphics hardware has geeralized this pipelie ito programmale elemets. The moder graphics pipelie cosists of vertex processig, rasterizatio ad fragmet processig. The vertex processor performs operatios o the idividual vertices of triagles set to the graphics accelerator. Oce trasformed, these triagles are rasterized ito a collectio of pixels. Each pixel output y the rasterizer is called a fragmet. The rasterizatio process liearly iterpolates attriutes, such as texture coordiates, stored at the vertices ad stores the iterpolated values at each fragmet. A fragmet processor uses the iterpolated texture coordiates to lookup texture values from texture memory, ad ca perform specialpurpose arithmetic operatios o oth the texture addresses ad the fetched texture values. The vertex processor is structured similarly to vector processors (pipelie o a stream of vertices), whereas the fragmet processor is structured similarly to a SIMD array processor (oe processor per pixel). Our experimets have show that the vertex processor does ot provide much advatage over existig CPU capailities, whereas the fragmet processor already outperforms the CPU o some operatios like ray-triagle itersectios [Carr et al. 2002]. Because these processors were origially developed for texturig, the programs the GPU executes are called shaders. A fragmet shader is a program executed y the B ca e issued as a sigle GPU istructio which refers to four simultaeous multiplicatios umerically equivalet to: R2.r R1.a R0.g R2.g R1. R0.g R2. R1.g R0.a R2.a R1.r R0. (2) The SIMD ature of the operatio defied i (1) allows for four additios to occur i parallel, takig oe fourth the computatio time of (2). I equatio (1), R1 s color chaels are refereced i aritrary order. This is referred to as swizzlig. The gree chael of R0 is refereced multiple times. This is kow as smearig. Aritrary swizzlig ad smearig (ad also egatio) of iput operads ca e doe with o performace pealty. This is i cotrast to the Itel s SSE istructios, where movig data etwee chaels requires additioal istructios. The output color of a fragmet program is placed i a desigated register (usually R0) upo termiatio of the program. This value is writte to the fragmet s scree locatio i the frame uffer. Fragmet shaders have access to three kids of data: costats, iterpolated vertex attriute data (e.g. texture coordiates), ad texture data (idexed y texture coordiates). Iterpolated vertex attriutes data are accessed as the registers T 0...Tm where m 1 is the umer of attriutes stored with each vertex. Data is fetched from texture memory y the lookup() operatio. For example, R0 lookup(t 0.r, T 0.g, M) uses the first two coordiates of T 0 to access the texture M. We ca also perform arithmetic o the texture coordiate efore the fetch, or we ca use the result of oe texture fetch as the coordiates for a secod texture fetch (depet texturig). Although fragmet shaders provide a very powerful SIMD model for programmig, they are curretly limited i a umer of ways. Moder implemetatios restrict the umer of availale registers, the total istructio legth, ad the

3 umer of lookup() operatios that may occur i a give fragmet shader. Cotrol flow is also restricted i fragmet shadig. For example, rachig is ot supported ad coditioal executio is limited to predicatig istructios o previously set coditio codes. Our model for a fragmet shaders is ased o the oe descried y the upcomig DirectX 9.0 specificatio [Marshall 2001]. This model provides capailities curretly foud i vertex shaders at the fragmet shader level. This model has also ee used to descrie the implemetatio of a ray tracer as a fragmet shader [Purcell et al. 2002]. This paper assumes similar fragmet processor capailities, specifically fragmet shaders of up to 256 istructios, a urestricted umer of texture access operatios, a set of at least six registers, ad stadard sigle-precisio floatig poit data formats. for k 1.. step for i 1... for j 1... fragmet shader R3.r 0 for m k...k 1 R1.r lookup(i, m, X) R2.r lookup(m, j, Y ) R3.r R3.r R1.r R2.r R4.r lookup(i, j, F ) R0.r R3.r R4.r Copy frame uffer ito texture F (5) 3 Cache Aware Matrix Multiply Suppose we are muliplyig two large matrices X ad Y, wlog who s dimesios are a perfect power of two, with 2 i rows ad 2 i colums. A geeral algorithm for computig Z XY ca e expressed as follows: for i 1... for j 1... Z ij 0 for m 1... Z ij Z ij X im Y mj The outer two loops are implemeted o the GPU y rerig a sigle scree fillig quadrilateral. This implies that everthig withi the outer two loops must e hadled y the fragmet shader. Below we have iserted pixel shader pseudo-code i the appropriate places. Matrices X ad Y are ow assumed to e stored i sigle chael texture maps ad accessed through the fragmet shader lookup() operatio. for i 1... for j 1... fragmet shader R3.r 0 for m 1... R1.r lookup(i, m, X) R2.r lookup(m, j, Y ) R0.r R0.r R1.r R2.r The aove psuedo-code i (4) requires that either loops are availale i fragmet shaders or that the fragmet shader istructio cout is log eough to allow the iermost loop to e urolled. We assume either are realistic assumptios. To address this issue, we tur to a stadard lockig strategy. Blockig has ee show to improve cache performace, ut for this applicatio lockig also serves the purpose of allowig us to work withi the costraits of our fragmet programmig model. The psuedo-code for our lockig strategy is show i (5). A ew matrix F is itroduced that is iitialized to e all zeroes ad used as a temporary store y the routie. The value is a scalar represetig the lock size. (3) (4) This ew algorithm is a multipass method requirig multiple rerigs to the frame uffer. The outer three loops are hadled y rerig scree fillig quadrilateral to the frame uffer / times. Betwee each of the / passes, the frame uffer is copied ito texture map F to e accumulated with result of the ext pass. This copy operatio is required sice moder GPU hardware does ot support direct lookup operatios o the frameuffer. Some graphics hardware does however, support lig modes allowig fragmet values to accumulate directly with the cotets of the frame uffer elimiatig the eed for the temporary texture F, ad cosequetly more efficiet rerig. The fragmet program (which covers the portio iside the j loop), ca ow e urolled y choosig a appropriate value of. For our tests we have chose 32. We were ale to reduce our total fragmet program istructio cout to e four istructios per iteratio of the m loop, for a total of total istructios. 4 Multi-Chael GPU Matrix Multiplies Texture map sizes o moder day GPU s are ofte restricted. Let eig the maximum size of ay dimesio. NVidia s GeForce4 Ti4600 has a maximum allowale rerale size of 2048, limitig multipass programs to two-dimesioal textures cotaiig at most elemets. Texture maps may cosist of etwee oe ad four chaels (lumiace, lumiace-alpha, RGB or RGBA). This implies that the GeForce4 ca hadle textures i size up to Methods have already ee preseted for hadlig matrices whose dimesios are at most 2048 usig sigle chael texture maps. It is aturally desirale to e ale to hadle matrices of larger sizes. The GeForce4 for example should e ale to multiply matrices i size y utilizig all four of the color chaels. This sectio descries a matrix multiplicatio algorithm that takes advatage of this four-compoet storage capaility. These four-chael textures store matrices of it floatig poit values, ad occupy 64MB. Our GPU matrix multiplicatio implemetatio requires four times this space for storig the two operads, a temporary store, ad the result. Curret cosumer level GPU s such as the GeForce4 curretly ship with 128MB of o-oard memory, suggestig a maximum capale matrix side of Exceedig this memory threshold uder a o-uified memory architecture of preset-day PC GPU s results i pagig to mai system memory, ad icreased traffic over the graphics card us.

4 4.1 Basic Formulatio Suppose we are muliplyig two large matrices X ad Y, wlog whose dimesios are a perfect power of two, with 2 i rows ad 2 i colums. XY Z (6) The matrix multiply i (6) ca e expressed as the followig series of matrix multiplies of smaller matrices X Y {[ }} ]{ {[ }} ]{ {[ }} ]{ A B E F AE BG AF BH (7) C D G H CE DG CF DH Elemets A, B, C, D, E, F, G, ad H are su-matrices decomposig X ad Y. Let the dimesios of A...H e 2 i 1 rows y 2 i 1 colums. 4.2 Blocked Matrix Texture Maps We ca store matrices X ad Y as texture maps i a 2 i 1 y 2 i 1 sized texture maps o the GPU, y placig the four su-matrices i the differet color chaels RGBA as X ( ) Ar B g,y C D a Z ( ) Er F g. (8) G H a We ow itroduce suscript otatio o matrices M, such that M i for i r, g,, a refers to the sumatrices composig M. For example X ry g AF. We have preseted our techique for multiplyig matrices cotaied i a sigle color chael X ry r Z r i Sectio 3. We ca ext this same approach to a SIMD otatio y usig multiple suscripts r, g,, a. For example: ( ) AEr AF X rr Y rgrg g. (9) CE CF a The aove otatio assumes a architecture where four su-matrices may e operated o i parallel y a sigle istructio. This otatio is useful sice graphics hardware is desiged i a SIMD maer to work simulateously o four color chaels at a time. Usig this otatio, we ca ow cocisely express matrix multiplicatio X ad Y as follows: X rga Y rga X rr Y rgrg X ggaay aa ( ) ( ) AEr AF g BGr BH g CE CF a DG DH a Z rga (10) 4.3 The Multi-Chael Algorithm To apply the formulatio derived i (10) ito a algorithm usale y graphics hardware, we must first re-examie fragmet shader programmig. As discussed i Sectio 1, a sigle lookup() operatio ca retrieve a 4-vector correspodig the four color chaels. If we store X as a texture map i locked form (8) the a sigle lookup R0 lookup(i, j, X) ca e used to retrieve four values comig from the sumatrices of R0 A ij,b ij,c ij,d ij. The four matrix multiplies from equatio (10) may ow e parallelized withi a sigle fragmet shader. The matrix swizzlig ad smearig suggested y (10) is hadled at the per-elemet level utilizig the capailities of graphics hardware, as show i (11). for k 1.../2step for i 1.../2 for j 1.../2 fragmet shader R3 0 for m k...k 1 R1 lookup(i, m, X) R2 lookup(m, j, Y ) R3 R1.rr R2.rgrg R3 R3 R1.ggaa R2.aa R3 R4 lookup(i, j, F ) R0 R3R4 copy frame uffer ito texture F (11) The aove algorithm represets a efficiet use of the SIMD computatio power of the GPU y workig o all four color chaels i parallel. This implemetatio oly icreases the fragmet shader istructio cout y oe per iteratio of m, thus resultig i a total fragmet program legth of with Aalysis We have aalyzed the ew locked ad multichael GPU matrix multiplicatio algorithms with respect to memory adwidth, istructio cout ad predicted performace. 5.1 Badwidth Cosideratios To aalyze the potetial adwidth limitatios for our approach, we first distiguish etwee the two adwidth limited areas of moder GPUs. The exteral adwidth is the rate at which data may e trasferred etwee the GPU ad the mai system memory. O moder PC s this is limited y the speed of the AGP graphics us which ca traser data at the rate of 1GB/sec. The iteral adwith is the rate at which the GPU may read ad write from its ow iteral memory. The GeForce4 Ti 4600 is curretly capale of trasferrig 10.4 GB/sec. For our applicatio the exteral adwidth of the GPU affects our applicatio i two areas. First, the matrices must e copied ito the GPU s memory as texture maps, ad the result of the computatio must e read ack from the card ito mai memory. These trasfers use the AGP us, which curretly has a theoretical adwidth of aout 1 GB/s (for AGP 4x). However, i practice sig data to the GPU is much faster tha readig data ack from the GPU (sice the hardware ad drivers are optimized for this case), ad the est speed we ve measured for readig data ack ito host memory is 175 MB/s. Eve assumig a average of 200 MB/s trasfer for oth reads ad writes, trasferig two sigle-precisio matrices to the GPU ad readig the result ack requires 60 ms. At 4 GFLOPS the actual computatio takes aout 510 ms. For this prolem, trasfer time is aout 11% of the total time. Thus, exteral adwidth is a sigificat ut ot overwhelmig fractio of the total time. The secod part of the algorithm affected y exteral adwidth is the time to s the geometry, the scree fillig quads o which the matrices are texture mapped. I fact this is a very small amout of data (aout 48 ytes per pass), ad graphics hardware is very good at trasferrig

5 geometry iformatio i parallel with other tasks, icludig ruig fragmet shaders. Thus, this cost is egligile. Oe of the primary ottleecks i performig matrix multiplies o the GPU is the iteral adwidth [Larso ad McAllister 2001]. This is also true for CPU implemetatios. For our aalysis we cosider mutiplyig two matrices. For oth the multi-chael ad sigle-chael lock-matrix approaches, the processig of each fragmet requires two texture lookup() operatios per iteratio of the ier m loop, plus a additioal lookup() to comie it with the results from the previous pass. A sigle write to output occurs per fragmet per pass as its result is writte to the frame uffer. Thus, there are 2 2 memory operatios per fragmet per pass. I our sigle chael method, every memory operatio trasfers 4 ytes of data. Our multichael method trasfers 16 ytes of data (4 chaels, 4 ytes per chael) per memory operatio. method passes frags/pass ytes/frag total L-M sigle multi 2 (2 2)4 2 ( 2 )2 (2 2) (1) 4 3 (1) Tale 1: Bytes trasferred iterally y each GPU matrix multiplicatio method. Tale 1 summarizes these results ad shows the total iteral adwidth i ytes trasferred y each method, which is just the product of the umer of passes, the fragmets per pass ad the ytes per fragmet. The L-M (Larso- McAllister) figures assume a implemetatio ased o a sigle floatig-poit chael. (Eve though multiple chaels were metioed, [Larso ad McAllister 2001] did ot descrie a multi-chael implemetatio.) The four-yte floats are accessed four times (two matrices, a temporary store ad a result) for a total of 16 ytes trasferred per fragmet. Our sigle-chael lock matrix algorithm performs idetically to L-M whe is set to oe (o lockig). As the lockig size grows, the adwidth drops y early a factor of two whe compared to L-M. The multi-chael method further reduces the memory adwidth y exactly oe half over the sigle chael method, reducig the adwidth to early 25% of L-M. 5.2 Istructios The GPU uses the fact that the same istructios are eig executed for a large umer of fragmets to overlap the processig of differet fragmets. As o commuicatio etwee executios of the fragmet shaders is eeded, a large amout of parallelism is availale. This parallelism is used to hide the latecy of memory operatios ad other causes of stalls. As a result, whe eough fragmets are availale (as i our case), the ruig time of a fragmet shader is approximately liear i the umer of istructios executed (assumig computatio is the limitig factor). Therefore, it makes sese to aalyze the umer of istructios used y our algorithm. Each executio of the fragmet shader eeds four istructios for setup ad addig i the result of the previous pass. The multi-chael algorithm also eeds three istructios per iteratio of the ier loop (two multiply-add istructios ad oe additio to update the texture idices). The sigle-chael algorithm removes oe of the multiply-add istructios. Therefore, the istructio couts are 3 4 for the multi-chael case ad 2 4 for the sigle-chael case. Note that i the multi-chael case, each may of the istructios ivolve a 4-wide data issue, operatig o the four color chaels i parallel. Curretly there is o performace pealty for this sice moder GPU s desiged to atively work o multiple chaels. Tale 2 summarizes the aalysis ad the total GPU floatig poit istructios required y each method. method passes frags/pass ist/frag total ist L-M sigle multi ( 2 ) (24) 3 (34) 8 Tale 2: Floatig poit operatios required y each GPU matrix multiplicatio method. The additioal fragmet program overhead makes the sigle chael lock-structured matrix multiplicatio loger tha the L-M algorithm. As icreases, the istructio cout asymptotically approaches that of the L-M algorithm. The multi-chael method executes 3/16ths as may istructios of either of the sigle chael methods as the lock size grows. 5.3 Performace ATLAS has demostrated 4.0 GFLOPS/s for matrix multiplicatio o a 1.5GHz Petium4 usig Itel s SSE2 SIMD istructios [Dogarra 2001]. Is the GPU comparale to this? For a matrix size of ad a lock size of 32 ( 1024, 32), our multi-chael algorithm trasfers GB of data. I order to match the ATLAS P4 SSE umers, we eed to perform this multiplicatio i 0.5 secods. This meas we will eed 8.25 GB/s of adwidth. Curret hardware has a theoretical adwidth of 10.4 GB/s to the mai memory; as with CPUs, it also has a cache etwee the GPU ad memory which supports much higher adwidth. Thus, existig hardware should e ale to support our adwidth eeds. Future hardware is likely to improve oth cache size ad performace ad memory adwidth. Performace of CPU implemetatios of matrix multiplicatio are typically limited y memory adwidth, ot CPU speed. Larso ad McAllister [Larso ad McAllister 2001] reported similar results with their GPU implemetatio. However, due to much lower clock speeds o CPUs relative to CPUs, the move from fixed-poit yte operatios to floatig poit may icrease the processig requiremets eough to make the GPU speed the ottleeck. 6 Coclusio We have preseted a multichael lock-ased GPU matrix multiplicatio algorithm. The lock structured approach should yield greater cache coherece tha previous methods. We also demostrated that our implemetatio uses oly aout 25% of the memory adwidth ad istructios whe compared to the previous method. Our results are curretly theoretical as we aticipate the implemetatio of graphics hardware that supports the DirectX 9.0 stadard. We expect such hardware will e availale efore the fial versio of this paper is required. (For example, as of this writig, 3Dlas has just aouced a processor, the P10, that partially satisfies upcomig stadards.)

6 We will the e ale to provide actual implemetatio times comparig our cache ad adwidth aware algorithms to the previous work [Larso ad McAllister 2001]. We have made umerous assumptios aout the performace of the upcomig hardware. These assumptios are ased o the speed of existig hardware, ut with the geerality to hadle the upcomig stadards. These simulatios ad aalyses suggest our method to e competitive with moder CPU implemetatios. The existece of hardware implemeatios will allow us to perform further tuig ad validate our claims with empirical data. Availale hardware would allow us to automatically tue our algorithm for a give GPU i much the same maer as performed y Altas. Emperical tests may e ru to provide searches over algorithm s parameter space to select the est aglorithm for a target GPU. Our algorithm is curretly parameterized y its lock size, ut we could also itroduce additioal parameters to cotrol order of rasterizatio ad lockig of memory layouts. The icreased memory adwidth ad SIMD orgaizatio of the GPU should make it a good choice for scietific applicatios. We have oetheless foud that the GPU remais aout as powerful as the CPU o actual tasks like matrix multiplicatio ad ray tracig [Carr et al. 2002]. This has ee disappoitig give the icreased adwidth ad processig power of the GPU. The costraits of GPU programmig coupled with the low adwidth coectio from the GPU ack ito the CPU have ee major ostacles i the capitalizatio of the GPU for scietific applicatios. We are oetheless ecouraged y the potetial of the GPU. Whereas future ehacemets to the CPU explore parallelism through speculative executio ad other proailistic methods, the GPU ca exploit parallelism across the frame uffer ad across the geometric data. This has ee partially resposile for the domiace of the GPU performace growth rate over that of the CPU. As GPU growth cotiues to outpace CPU growth, we expect the GPU will ecome the preferred platform for persoal highperformace scietific computig. Larso, S. E., ad McAllister, D Fast matrix multiplies usig graphics hardware. Super Computig (Nov.). Marshall, B DirectX graphics future. Microsoft DirectX Meltdow 2001 (Jul.). Purcell, T. J., Buck, I., Mark, W. R., ad Haraha, P Ray tracig o programmale graphics hardware. I Proceedigs of SIGGRAPH 2002, ACM Press / ACM SIGGRAPH, J. F. Hughes, Ed., Computer Graphics Proceedigs, Aual Coferece Series, ACM. Whaley, R. C., Petitet, A., ad Dogarra, J Automated empirical optimizatios of software ad the ATLAS project. Parallel Computig 27, 1-2, Woo, M., Neider, J., Davis, T., ad Shreier, D OpeGL Programmig Guide. Addiso-Wesley, Readig, MA, USA. Ackowledgmets This research was supported i part y the NSF uder the ITR grat ACI , ad y NVidia Corp. Coversatios with Jack Dogarra ad Jim Demmel (ad his studets) were also quite helpful. Refereces Aerdee, D., ad Baxter, J Geeral matrixmatrix multiplicatio usig SIMD features of the PIII (research ote). I Europea Coferece o Parallel Processig, Carr, N. A., Hall, J. D., ad Hart, J. C The ray egie. Tech. Rep. UIUCDCS-R , Uiversity of Illiois at Uraa-Champaig, Mar. Dogarra, J A update of a couple of tools: AT- LAS ad PAPI. DOE Salisha Meetig (Availale from SLIDES/salisha.ps), Apr. Kilgard, M. J GL NV texture rectagle. cotet/vopeglspecs/ GL NV texture rectagle.txt.