The present invention relates in general to memory interface devices and in particular to a memory interface that dynamically selects among mirrored storage locations when stored data is being accessed.
Computer-based techniques for rendering and displaying animated images are known in the art. Typically, a three-dimensional (3-D) rendering process begins by modeling each object in the scene using vertices that define primitives (lines, triangles, and other simple shapes) corresponding approximately to the 3-D shape of the object. Each vertex has various attributes (color, surface normal, texture, etc.) representing the appearance of that portion of the object. Computations are performed to model the interaction of light with the objects. A viewpoint and view direction within the scene are specified, and coordinates of the various primitives in the scene are transformed into viewing coordinates and projected onto a viewing plane to determine which objects are visible at each of an array of pixels. Each pixel is shaded based on the object or objects visible at that location. The color value for each pixel is delivered to a display device, and the image appears. To achieve smooth animation, new images must be rendered and displayed at a rate of about 30 frames per second (or in some instances even faster).
Rendering realistic images at this rate requires substantial processing power. Typically, the processing power is supplied by a dedicated graphics processor that has a number of parallel processing cores optimized for performing the computations associated with rendering. Modern graphics processors are powerful enough to operate on hundreds of millions of vertices per second.
To operate at such rates, a graphics processor must be able to receive and store massive amounts of data at suitably high rates. In the course of rendering, the graphics processor requires access to vertex data describing objects to be rendered (including their attributes) and also to various data surfaces (i.e., data blocks that stores a value of some quantity corresponding to each location in some coordinate space, such as textures, pixel buffers, depth buffers, stencil buffers, etc.) and to other data such as command sequences that control operations of the graphics processor. Typical rendering applications require many megabytes of data; for instance, just to store a pixel buffer of 1024×768 pixels using 32-bit color requires 6 megabytes (MB). Textures can be as large or even larger. These large quantities of data are generally stored in memory accessible to the graphics processor, such as dedicated graphics memory accessible via a “local” bus and/or main system memory, which is accessible to the graphics processor via a “system” bus. To render images at animation speed, data needs to be moved quickly between the processor and memory. Accordingly, graphics processors are designed to support very high input/output (I/O) rates, e.g., in excess of 30 gigabytes per second (GB/s).
In some systems, to optimize cost versus performance, it is desirable to exploit a combination of local and system memory to obtain acceptable performance. In one such configuration, the local graphics bus provides a throughput of about 5.8 GB/s, and a high-speed system bus such as Peripheral Component Interconnect (PCI) Express referred to herein as “PCI-E,” provides up to about 3 GB/s for memory read operations and about 2.3 GB/s for memory write operations. Thus, the performance of modern graphics processing systems is usually limited by bus bandwidth rather than internal processing power.
To increase available bandwidth, some graphics subsystems use both local and system memory for data storage. By storing some data in local graphics memory while other data is stored in system memory, the combined bandwidth of both local and system buses (about 10 GB/s in one implementation) can be exploited. For example, suppose that in a system with the bus configuration described above, a graphics processor requires a color buffer, a depth buffer and a texture surface. Suppose further that to achieve a desired level of performance, the color buffer and the depth buffer each need to deliver data at 3 GB/s and the texture surface needs to deliver data at 4.5 GB/s. If the color and depth buffers are stored in local memory while the texture surface (which is read but not written by the graphics processor) is stored in system memory, then 4.5 GB/s of read data would be requested on the system bus while only 3 GB/s could be supplied. Thus, the system bus would deliver texture data at only about 67% of the desired rate, compromising performance. If, instead, the color and texture surfaces are stored in local memory while the depth buffer is stored in system memory, then the local bus would be overloaded, delivering data at about 77% of the desired rate and again compromising performance.
Another option is to split one or more of the surfaces between local and system memories. For instance, in the example above, with the color buffer in local memory and the depth buffer in system memory, the texture surface could be split, with two thirds of the texture data in local memory and the other third in system memory. Then, on average the local bus would be asked to carry 6 GB/s and would be just slightly overloaded, while the system bus would not be overloaded at all.
In practice, however, there are considerable fluctuations in the graphics processor's demand for data from one frame to the next and at different times during the rendering of each frame. For instance, different frames might require data from different parts of a texture surface, or different pixels within the frame might be mapped to different numbers of textures. Thus, even where the average bandwidth demand can be met, fluctuations can still saturate one or the other (or both) of the buses, causing delay in the rendering process, which stalls while waiting for needed data.
It would therefore be desirable to provide improved techniques for making efficient use of the available bus bandwidth under changing conditions.