OUT OF ORDER RENDERING ON VISUALIZATION CLUSTERS

Transcription

1 OUT OF ORDER RENDERING ON VISUALIZATION CLUSTERS Karl Rasche Department of Computer Science Clemson University Clemson, South Carolina, USA Robert Geist Department of Computer Science Clemson University Clemson, South Carolina, USA James Westall Department of Computer Science Clemson University Clemson, South Carolina, USA ABSTRACT A technique for workload balancing on distributed visualization clusters is suggested. The round-robin scheduling of partial object geometry chunks avoids render node starvation and balances the data backlog in the send buffers of the geometry nodes. Because frame synchronization requires that all send buffers drain, balancing these send buffer backlogs is seen to substantially improve frame rate performance for dynamic displays. KEY WORDS Chromium, distributed rendering, tiled displays, workload balancing processing including reading back the frame buffers of the rendering nodes for visualization on special display nodes. Most rendering applications make use of a standard graphics library such as OpenGL [4]. Chromium [5] is a generic software framework for distributing and processing streams of OpenGL commands. The rendering application context is described by a directed graph of stream processing units, or SPUs. Calls by the application to the OpenGL library are intercepted by Chromium and re-routed through the SPU graph. For a sort-first system, one uses a Tile sort SPU to sort and distribute geometry to a set of Render SPUs or Readback SPUs. The latter type renders and makes results available for downstream processing. 1 Introduction In a sort-first parallel rendering system the workload is divided based on a tiling of the two-dimensional display, i.e., a tiling of image space [1]. Image tiles are assigned to rendering tasks. Geometry is sent to a rendering task if that geometry would fall within the tile assigned to that renderer. The geometry must be projected from (3D) world coordinates into (2D) image space before this distribution can be made, as the tile assignment is not known a priori. To avoid transforming all geometry, bounding boxes for sections of geometry are computed in world coordinates. The bounding boxes are then projected into image space, and the corresponding geometry is sent to the appropriate rendering task as determined by the bounds. For a fixed image space resolution, increasing the number of image tiles decreases the average tile size, and thus more tiles are likely to be covered by any given bounding box. Since all geometry in a box must be sent to all rendering tasks controlling all tiles mapped by the box, this can lead to increased network transmission. This overlap problem is a classical hindrance to the scalability of sort-first systems. Recently, there has been an increased interest in parallel, real-time rendering using clusters of commodity, offthe-shelf components [2, 3]. In a sort-first configuration, one set of nodes (PCs) is usually assigned to transforming and sorting geometry. This set is connected, in turn, to other sets of nodes with graphics hardware that perform the rendering. Several options are available for downstream 2 Motivation In May of this year, we installed a 265-node distributed rendering system in Clemson s W. M. Keck Visualization Laboratory. Each node has a 1.6GHz Pentium IV CPU, 58GB disk, and dual Ethernet NICs. The 240 geometry/rendering nodes each have 512MB main memory, an Nvidia GeForce4 TI 4400 graphics card, and dual 100Mb NICs. The 24 display nodes, which drive a 6x4 projector array, each have 1GB main memory, a Matrox G450 graphics card, a Gigabit NIC, and a 100Mb NIC. All nodes are connected via a dedicated Gigabit Ethernet switch. We are interested in examining the network performance of the connections between the geometry sorting nodes and the rendering nodes. For dynamic displays, the overlap of the geometry and the image tiling is constantly changing, and it is important to understand this rich and dynamic behavior, as this is usually the limiting factor in the performance of visualization clusters [6]. The communication between the sorting nodes and the rendering nodes must be synchronous, at least at an image frame level. After sorting and sending all of the geometry for one frame, the sorting node must wait for all of the data to be received by the rendering node(s) before transmitting the next frame of geometry. Failure to synchronize can cause large backlogs of geometry to be queued on the rendering nodes. For interactive applications, this can yield large latencies when responding to user requests. An un-

2 synchronized connection could suffice for non-interactive, non-dynamic applications. We suggest that the order with which geometry is drawn is important for the performance of the sort-torender network. Consider the case in which two objects of complex geometry are each mapped to a separate tile with no overlap. From an application-level standpoint, the natural approach to drawing any scene, including this one, has an object level granularity. That is, the first object is drawn completely, and then the second object is drawn. This can yield an unfortunate interleaving of the connections to which geometry is sent. In the case at hand, a large amount of data is sent to the first rendering node while the second rendering node sits idle. When the sorting of the geometry in the first object completes, the situation reverses and the first rendering node is idled while the second rendering node is swamped. Geometry interleaving has two important and related effects on network performance. First, the rendering nodes should always have data queued in their receive buffers when they are ready to render. Starving a rendering node delays the end of the frame synchronization, which in turn decreases the performance of the system. Second, a poor interleaving can cause large backlogs of data in the send queues on the sorting node(s). This backlog must be fully drained prior to the synchronization step. The larger the backlog, the longer the synchronization will take to complete. To see the effect that geometry interleaving has on performance in an asymmetric network, we conducted a simple experiment using Chromium on nodes in the Keck system. Image space was divided into 6 tiles, each a horizontal strip running the width of the image. Geometry consisted of a set of points in the plane. The points were divided into a 10 x 10 array of bins, where the axes of the array were aligned with the image. We then walked through the array of bins, sending all of the geometry for each bin and flushing the sorting node after each bin. In the first test, we walked through the array in column-major order, and in the second, row-major order. Figure 1 illustrates the tile and geometry layout. The sorting node had a 1000 Mbps TCP/IP connection, and the rendering nodes each had 100 Mbps TCP/IP connections. Figure 2 shows the data received by one of the rendering nodes as a function of elapsed time. Similar results were found for the other rendering nodes in the system. Note the stair-step effect that occurs in the row-major walk. This is a result of biasing the distribution of data to one tile at a time. The column-major walk is much closer to an equal distribution. Note that it starts by sending the first bin of data to the first tile, then it sends a second bin of data that overlaps the first two tiles, then it sends a bin of data to the second tile, and so on. This approach keeps all the send queues for all connections relatively balanced and avoids the starvation that is seen in the row-major walk. Note that the stalling in the row-major Figure 1. Tiling setup for the first experiments walk is not due to processing by the sorting node, as this would yield large stalls in the column-major walk as well. The round-robin (Reordered) scheduling shown in the figure will be explained in section 3. We also recorded the total bytes in the send queues at the time of the post-frame synchronization under each transmission scheme. These are shown in Figure 3, where we can see that a large backlog occurs in the row-major walk, but not in the column major walk. The round-robin (Reordered) scheduling shown in the figure will be explained in section 3. 3 Out of Order Rendering From this experiment, we see that it is advantageous to sort geometry so that sequential sections of geometry are sent to different tiles. To achieve this advantage, the user application could be altered so that it draws small portions of objects in an interleaved fashion. This would solve the problem if the mapping of the geometry to the tiles were known to the application. For the general application, this is impossible. Simply interleaving sections of the objects will not guarantee a balanced interleaving over the tiles. We suggest that a two part solution to this problem will be required. The first part is a decomposition of the objects in the scene into small sections or chunks. The desired granularity of the chunks will depend upon the scene and system. Recent, distributed OpenGL systems [5, 6] have operated by intercepting calls with a fake dynamic library and re-routing them as the setup demands. Such an approach would be very difficult to use with this decomposition, as it would require retaining a considerable

3 Figure 2. Rendering node throughput (static points) Figure 3. Send queue data at synchronization (static points)

4 amount of state data prior to performing the decomposition. Samanta et al [7] propose using such a retained mode decomposition, but such is too restrictive for our purposes. We need a view-independent decomposition that can be performed as a preprocessing step for static geometry. We are investigating automated decomposition, but for now we perform this step on an application-specific basis. The second part of the solution requires scheduling the order in which the chunks should be sorted and transmitted for rendering. If we assume a fair amount of temporal coherence in our dynamic displays, we can use the overlap information from the previous frame to schedule the chunks for the next frame. 3.1 Tracking In order for the system to keep track of the chunks of geometry, an identifier is needed for each as well as a method of querying the scheduler for the next chunk of geometry to be scheduled. We propose two new OpenGL functions to fill this void. The first, glgentokens(), defines a set of identifiers. The second, glnexttoken(), returns the identifier of the next chunk to be drawn. An explicit scheduling call is not necessary, as it can be done at the time of Swap- Buffers(). Below is a sample rendering loop. /* generate a set of 100 chunk ids */ glgentokens(1,100); /* schedule and render */ while(1){ int next; glnexttoken(1,&next); if(next == -1) break; render_chunk(next); } SwapBuffers(); The function render chunk() is responsible for issuing the commands to draw a given chunk of the scene geometry. 3.2 Scheduling The goal of the scheduler is to provide a good, i.e., temporally balanced, interleaving of geometry to the various rendering nodes. Our scheduler operates in a round-robin manner. We maintain N + 1 lists of identifiers for a system with N tiles. The extra list is used for geometry that falls outside the viewport. In the case where geometry overlaps multiple tiles, a record is stored in each of the (covered) tile lists. The scheduler removes a chunk identifier from the first list and removes any instances of the same identifier from other lists. The chosen geometry chunk identifier is the next value returned by glnexttoken(). This procedure continues in a round-robin fashion until all lists are exhausted. This provides an interleaving but not a perfect balance. It ignores cases where a geometry chunk overlaps multiple tiles. If an identifier is at the head of one list and also appears in the neighboring list, the neighboring list, which is likely to receive the overlapping identifier when this chunk is rendered, probably should be skipped in this scheduling pass. We also note that our testing procedure assumes a uniform granularity of geometry sections. If this is not the case, the lists can be sorted by size before scheduling. 4 Measurement We repeated the experiment of Figure 1 with the roundrobin reordering system, and the resulting throughput is shown as the Reordered line in Figure 2. Note that the throughput is approximately the same as that for the column-major walk. This is to be expected, as the roundrobin scheduling should produce an order similar to the column-major walk. Near the beginning of the transfer, the reordered walk appears to behave similarly to the rowmajor walk, but, as time progresses, the behavior aligns itself with that of the column-major walk. At this stage, it would appear that there is no benefit in reordering. Similar results were found when measuring the amount of data queued in send buffers on the sorting node, as seen in Figure 3. The round-robin scheduling yielded the minimum amount of data queued, but the advantage over the columnmajor walk was not large. In real visualization applications, it is likely that the geometry will be moving with respect to the image plane. To account for this in our test program, we performed a second experiment in which the points were rotated a few degrees every frame. The results are shown in Figure 4. In this case, the row-major and column-major walks oscillate as the bins line up with the image tiles. The round-robin reordered walk displays little of this behavior and maintains a fairly constant, higher throughput. Looking at the send queue data at synchronization instants, Figure 5, we see similar results. There are occasional steps in the roundrobin walk, likely due to multiple overlaps, but it is clear that reordering is advantageous. For a less contrived test, we used our reordering technique on a cloud of 70,000 points scanned from the recently salvaged civil-war submarine, the H. L. Hunley [8]. The complete Hunley scan is shown in Figure 6. This data set has a significantly different spatial distribution from the data set used in the previous tests. Figure 7 shows the data queued at synchronization instants as the point cloud slowly rotates. We can see that while the difference is somewhat less than in the previous experiments, the reordering remains beneficial.

5 Figure 4. Render node throughput with a rotating set of points Figure 5. Data in send queues at synchronization for a rotating set of points

6 5 Discussion While reordering data has benefits, it is not without potential drawbacks. The first difficulty is the code structure demand placed on the user application. It is not easy to take existing OpenGL code and reorder the geometry without inducing unwanted modifications in the resulting image(s). Automated decomposition of the geometry using spatially non-uniform octrees (KD trees) should be a productive avenue for exploration. It may be possible to take a sequential geometry drawing loop and preprocess it into sections, perhaps with the aid of a profiler. Another potential problem is thrashing. There may be large changes in (OpenGL) state that must take place before a section can be sorted and transmitted for rendering. Perhaps even more serious is texture thrashing. Reordering may destroy texture coherence, requiring frequent uploading of textures to the graphics hardware. This is part of the motivation for dividing geometry sections into sets. To attempt to minimize this thrashing, sets with like state and textures can be grouped together. Depending on the spatial locality of the sections, this may increase rendering speed, at some expense in throughput. Unfortunately, since network throughput is generally the bottleneck [6], this may not cause any improvement in overall performance. In our scheduling scheme, there are no provisions for multiple tiles that map to the same rendering node. This problem is easily solved by concatenating hit lists for tiles for which this occurs. Transparency is also troublesome. Traditionally, rendering transparent objects requires control over the order with which objects are drawn [4]. One potential solution would be to allow only part of the scene to be reordered by inserting sequential drawing instructions between sets of reorderable geometry. Another potential solution would be to use a different approach to rendering transparent objects. Everitt describes an order-independent method for rendering transparent objects [9]. This would allow for all objects in the scene to be reorderable, but this solution resembles a sort-last parallel system more than a sort-first system. Finally, while it is relatively straightforward to decompose static geometry into chunks, it may be more difficult to maintain a good decomposition for dynamic data. 6 Conclusions and Future Work We have suggested a technique for workload balancing on distributed visualization clusters. The round-robin scheduling of partial object geometry chunks avoids rendering node starvation and balances the data backlog in the send buffers of the geometry nodes. Because frame synchronization requires that all send buffers drain, balancing these send buffer backlogs is important to frame rate perfor- Figure 6. Complete scan data from H. L. Hunley mance for dynamic displays. An automated geometry decomposition system could be quite useful in conjunction with our scheduler. Decomposition levels might be controlled by feedback from previous frame rendering performance. A scheduling algorithm that takes into account the interleaving of geometry chunks that overlap tiles may provide further improved throughput. This may become more apparent when using scenes with high overlap. Nevertheless, load balancing through re-tiling of the image strives to decrease overlap [3], and, if overlap is kept to a minimum, improving our scheduler to better handle these cases may yield little performance improvement. In some cases, an intercept-and-stream model of distribution [5, 6] may not be optimal. Take, for example, the case of normal vectors. With many surface representations, normals can be recovered from the local surface properties. We could therefore reduce the bandwidth necessary to distribute the surface if the normal computation were deferred until after transmission. In a stream processing environment such as Chromium, this would be possible by inserting a processing element into the processing graph after the transmission stage. Unfortunately, as geometry is distributed, topological information is generally lost. While some of this information can be recovered with additional processing, this may introduce unwanted latency or retransmission. A system with partial data replication on the rendering nodes may offer substantial improvements. Each chunk identifier could be associated with a set of memory pages. From these pages, the stream of GPU commands could be generated. If a chunk sent to a rendering node causes a fault on its required set of pages, the rendering node could simply enqueue the identifier and proceed with execution

7 Figure 7. Data in send queues at synchronization instants while rotating the H. L. Hunley data set of the next ready chunk. A similar enqueue-and-move-on strategy could be employed in a distributed texture system, where the tracking of chunks could be used for preloading of the textures. 7 Acknowledgments This work was supported in part by the ERC Program of the U.S. National Science Foundation under award EEC , the ITR Program of the National Science Foundation under award ACI , and a grant from the W. M. Keck Foundation. References [1] C. Mueller. The Sort-First Rendering Architecture for High-Performance Graphics. In Proc. ACM SIG- GRAPH Symposium on Interactive 3-D Graphics, pages 75 84, [2] G. Humphreys and P. Hanrahan. A Distributed Graphics System for Large Tiled Displays. In Proc. IEEE Visualization, pages , [3] R. Samanta, J. Zheng, T. Funkhouser, K. Li, and J. P. Singh. Load Balancing for Multi-Projector Rendering Systems. In Proc. SIGGRAPH/Eurographics Workshop on Graphics Hardware, pages , [4] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL Programming Guide. Addison-Wesley, 3rd edition, [5] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. Kirchner, and J. T. Klosowski. Chromium: A Stream Processing Framework for Interactive Rendering on Clusters of Workstations. ACM Trans. on Graphics (Proc. SIGGRAPH 2002), 21(3): , July [6] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Everett, and P. Hanrahan. WireGL: A Scalable Graphics System for Clusters. In Computer Graphics (SIG- GRAPH 2001 Proceedings), pages , [7] R. Samanta, T. Funkhouser, K. Li, and J. P. Singh. Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster of PCs. In Proc. SIGGRAPH/Eurographics Workshop on Graphics Hardware, pages , [8] G. Oeland and I. Block. The H.L. Hunley. National Geographic, pages , July [9] C. Everitt. Interactive order-independent transparency. Technical report, Nvidia, developer.nvidia.com.