The field of the present invention pertains to computer implemented graphics. More particularly, the present invention relates to a system and method for implementing variable texture replication in a graphics subsystem.
Computer graphics are being used today to perform a wide variety of tasks. Many different areas of business, industry, government, education, entertainment, and most recently, the home, are tapping into the enormous and rapidly growing list of applications developed for today""s increasingly powerful computer devices.
Graphics have also become a key technology for communicating ideas, data, and trends in most areas of commerce, science, and education. Modern graphics workstations often implement real time user interaction with three dimensional (3D) models and pseudo-realistic images. These workstations typically contain dedicated, special purpose graphics hardware. The progress of semiconductor fabrication technology has made it possible to do real time 3D animation, with color shaded images of complex objects, described by thousands of polygons, on powerful dedicated rendering subsystems. The most recent and most powerful workstations are capable of rendering completely life-like, realistically lighted, 3D objects and structures.
In a typical 3D computer generated object, the surfaces of the 3D object are described by data models. These data models store xe2x80x9cprimitivesxe2x80x9d (usually mathematically described polygons and polyhedra) that define the shape of the object, the object attributes, and the connectivity and positioning data describing how the objects fit together. The component polygons and polyhedra connect at common edges defined in terms of common vertices and enclosed volumes. The polygons are textured, Z-buffered, and shaded onto an array of pixels, creating a realistic 3D image.
In a typical graphics computer, most of the actual rendering computation is performed by a graphics subsystem included in the graphics computer. The 3D object data models are xe2x80x9ctraversedxe2x80x9d by a software program (e.g., in response to user input) running on one or more processors in a processor subsystem within the graphics computer. The primitives describing the 3D object are processed by the processor subsystem and sent to the graphics subsystem for rendering. For example, a 3D polyhedra model of an object is sent to the graphics subsystem as contiguous strips of polygons, sent to the graphics subsystem as a graphics data stream (e.g., primitives, rendering commands, instructions, etc.). This graphics data stream provides the graphics subsystem with all the information required to render the 3D object and the resulting scene. Such information includes, for example, specular highlighting, anti-aliasing, depth, transparency, and the like. Using this information, the graphics subsystem performs all the computational processing required to realistically render the 3D object. The hardware of the graphics subsystem is specially tuned to perform such processing quickly and efficiently in comparison to the processor subsystem.
Texture mapping is an important part of the 3D rendering process. In order to portray a more realistic real-world representation, texture mapping is usually applied to the 3D objects of the scene during rendering. Texture mapping refers to techniques for using multi-dimensional (e.g., 2D, 3D, etc.) texture images, or texture maps, for adding surface details to areas or surfaces of these 3D graphical objects. For example, given a featureless solid cube and a texture map defining a wood grain pattern, texture mapping techniques may be used to map the wood grain pattern onto the cube. The resulting image is that of a cube that appears to be made of wood. In another example, vegetation and trees can be added by texture mapping to an otherwise barren terrain model in order to portray a landscape filled with vegetation and trees.
Texture mapping is typically implemented during rasterization steps of the rendering process. For example, during rasterization, a texture element, or texel, is generated from a stored texture image (e.g., within a texture memory) and applied to each fragment of a particular surface. The individual texels represent the color of the texture image to be applied to respective corresponding fragments. A texture mapping process maps a portion of the specified texture image onto each primitive. Texture mapping is accomplished by using the color of the texture image at the location, for example, by overwriting or modifying the fragment""s RGBA (Red, Green, Blue, Alpha) color.
The performance of the texture mapping process is highly dependent upon the performance of the underlying hardware. High performance texture mapping requires high power, high bandwidth rendering hardware within the graphics subsystems. One technique for accomplishing this is xe2x80x9cpipeliningxe2x80x9d the graphics subsystem.
In a pipelined architecture, the graphics subsystem is configured as a series of interconnected stages used to render an image. Each stage performs a unique task during each clock cycle, for example, where one stage might be used to scan-convert a pixel; a subsequent stage may be used for color conversion; another stage could be used to perform depth comparisons; this is followed by a texture stage for texturing; etc. The advantage of using a pipelined architecture is that as soon as one stage has completed its task on a pixel, that stage can immediately proceed to work on the next pixel without having to wait for the processing of a prior pixel to complete. Accordingly, pixels flow through the pipeline at a rapid rate.
However, one drawback of a pipelined architecture is that since each stage performs a unique function, the stages are typically constructed from specialized circuit designs. And even though a single pipeline architecture often entails the use of hundreds of such stages, there still exists a finite limit to the speed at which graphics data can proceed through the pipeline.
A more modern architecture involves the use of parallel rendering hardware within the graphics subsystem. To increase performance (e.g., texture mapping speed), rendering components (i.e. multiple sub-pipelines) are implemented to process graphics data in parallel to increase the total aggregate speed of the graphics subsystem. Such a parallel processing environment allows the rendering process to be apportioned among a series of parallel rendering components to achieve a much faster peak performance than possible in the more conventional pipelined architecture. Hence, the most modern graphics subsystem architectures are implemented as parallel processing environments.
With respect to high performance texture mapping, in such a parallel processing environment, a graphics subsystem would typically include multiple parallel xe2x80x9cgeometry enginesxe2x80x9d coupled to multiple parallel raster engines. Each geometry engine performs geometry processing on, for example, a specific portion of an image, and sends the resulting graphics data to a corresponding raster engine for fragment processing (e.g., texture mapping, antialiasing, rasterization, etc.). The texture mapping processing is performed in parallel.
However, parallel processing within the graphics subsystem leads to other types of problems. To perform texture mapping in such a parallel environment, each raster engine may need to maintain a copy of the texture (e.g., texels of a texture image). This is required in order to ensure the raster engine is not starved for texture data, as, for example, in a case where raster engines contend for access to a texture map stored in a single shared memory.
One solution is to have large texture memories for every raster engine to accommodate very large textures. The problem with this solution is that it is very expensive. A high performance parallel rendering subsystem requires large texture memories for each raster engine, so that each raster engine has fast access to its own complete copy of the texture image. This implementation is referred to as xe2x80x9cstatically replicatedxe2x80x9d or xe2x80x9cfully replicatedxe2x80x9d textures, which refers to keeping a complete copy of the entire texture image with each raster engine""s texture memory.
The problem with statically replicated textures is the fact that it is wasteful of memory. The larger the texture (e.g., size of the array of texels), the larger the dedicated texture memory for each raster engine needs to be. Hence, to efficiently handle applications requiring large textures, the dedicated texture memory needs to be appropriately sized. However, for those applications using medium or small textures, the large texture memories are mostly wasted.
One solution to this problem involves the partitioning of a large texture into respective portions and storing these portions into corresponding raster engines (e.g., the dedicated texture memory coupled thereto). This solution is often referred to as xe2x80x9cfully apportionedxe2x80x9d texture storage. In such an architecture, the raster engines access their respective coupled texture memory for texture data supporting their respective portion of the texture mapping process. However, if the texture is very large, raster engines usually have to access texels stored in other raster engine""s texture memory. This can lead to large amounts of bus traffic.
An important factor determining the amount of bandwidth consumed by texture transactions is texel size. Texel size refers to the number of bits required to represent each texture element in a texture. The texel size is determined by the needs of graphics applications, such as greyscale vs. color, precision needed, tolerability of compression, etc. The smallest texels can be generated using compressed texture algorithms (i.e. 4-bit texels); large texels can be 64-bits or larger.
Multiple texture fetches over the bus/interconnect coupling the raster engines can consume excessive amounts of bandwidth. The problem is worse with large texels. Hence, large textures most suited to use with high performance graphics subsystems typically cause the most bus traffic amongst the parallel raster engines. The large number of fetches can saturate the bus. Even in those architectures which implement cross bar switching for increased data transfer bandwidth, the large number of fetches (especially with uncompressed textures) can saturate the cross bar network. Thus, the more saturated the network becomes, the slower the performance.
Thus, what is required is a method for efficiently handling large textures (large texels e.g., 64 bits) in a parallel processing environment. What is required is a high performance method of supplying texture data to multiple parallel raster engines that does not incur the cost penalties of a full statically replicated texture environment. The required solution should further provide the efficiency of apportioned texture storage amongst parallel raster engines that does not incur the data transfer saturation penalties of texture mapping with large textures. Additionally, because texel size requirements vary from application to application, the solution should be configurable to strike the proper balance between performance and memory utilization. The present invention provides a novel solution to the above requirements.
The present invention is a method and system for variable texture replication in a parallel graphics subsystem. The method and system of the present invention provides a method for efficiently handling large textures (large texels e.g., 64 bits) in a parallel processing environment. The variable texture replication process of the present invention provides a high performance method of supplying texture data to multiple parallel raster engines that does not incur the cost penalties of a full statically replicated texture environment. In addition, the present invention retains the efficiency aspects of apportioned texture storage amongst parallel raster engines and does not incur the data transfer saturation penalties of texture mapping with large textures/texels.
In one embodiment, the present invention is implemented as an adjustable texture replication process within a parallel processing environment of a graphics subsystem. The graphics subsystem performs the rendering processing for a digital computer system (e.g., a graphics workstation). The process is implemented within the graphics subsystem, and includes the step of configuring a plurality of raster engines (e.g., four parallel raster engines) into at least a first cluster and a second cluster (although a larger numbers clusters can be implemented in more highly parallel environments). The raster engines of the first cluster and the raster engines of the second cluster are each communicatively coupled to respective texture memories. A first texture image copy is stored among the texture memories of the first cluster such that each respective texture memory stores a respective portion of the first texture image copy. A second texture image copy is stored among the texture memories of the second cluster such that each respective texture memory stores a respective portion of the second texture image copy. A parallel texture mapping process is performed on a surface using the first cluster and the second cluster. The first cluster texture maps the first texture image copy, wherein the plurality of raster engines of the first cluster share access to the each respective texture memory storing the first texture image copy. The second cluster texture maps the second texture image copy, wherein the plurality of raster engines of the second cluster share access to each respective memory storing the second image copy.
In this manner, most of the communications traffic between raster engines occurs xe2x80x9cwithin clusterxe2x80x9d, meaning that fetches of texture data occur amongst raster engines in the same cluster since each cluster stores a complete texture image copy. This aspect greatly reduces the aggregate amount of communications traffic in comparison to that of a prior art xe2x80x9cfully apportionedxe2x80x9d texture storage scheme. Additionally, a complete texture image copy remains readily available within cluster, without requiring the memory hardware expense associated with prior art xe2x80x9cfull replicationxe2x80x9d schemes.
In accordance with the present invention, the number of raster engines included in the first cluster and the number of raster engines included in the second cluster, and the number of clusters themselves, are adjustable to implement variable texture replication. For example, each additional raster engine included in a cluster causes an additional xe2x80x9cper-enginexe2x80x9d apportionment of the texture image copy, such that each engine in the cluster stores a respective portion of the texture image copy. In addition to increasing or decreasing the number of raster engines included in the first and second clusters, the plurality of raster engines of the graphics subsystem can be further divided into a larger number of clusters, such as, for example, dividing an eight way parallel subsystem into four clusters of two, or dividing a sixteen way parallel subsystem into eight clusters of two or alternatively two clusters of eight. In each case, a complete copy of the texture image is maintained within each cluster (e.g., apportioned among the raster engines of the cluster). This division of the rasterization hardware can be specified by an application in such a way as to best meet that application""s needs.
In most graphics systems capable of hardware-accelerated texture mapping, multiple different texture maps may be stored in the texture memory by the application. In a graphics system implementing variable texture replication, not all of the texture memory need be allocated to a single cluster. The addressable texture memory may be divided up into one or more segments, each allocated to a different supercluster (Supercluster meaning the set of all rasterizers, divided up into a particular clustering topology). For example, if the available texture memory is divided into two parts, the first part would map into a fully apportioned supercluster (i.e. textures stored in the first memory segment are divided among all rasterizers), and the second part would map to a half-and-half supercluster (i.e. two copies of texture). This added flexibility allows for finer configuration of the balance between performance and memory usage, as not all textures used by a particular application will have the same texel size.