1. Field of the Invention
Embodiments of the present invention generally relate to mapping virtual memory pages to physical memory pages in a local memory system based on dynamic random access memory and, more specifically, to memory addressing controlled by page table entry fields.
2. Description of the Related Art
Conventional graphics processing units use off-chip memory to store different kinds of data, such as image data and texture map data. Because the memory is off-chip, the access speed associated with the memory is inherently slower than accessing or processing on-chip data, thereby creating a performance bottleneck. The data stored within the external memory is used to generate and display sequential frames of graphics images and any input/output bottlenecks associated with accessing the external memory therefore impacts the overall performance of the graphics processing unit. The off-chip memory is typically constructed using multiple DRAM (dynamic random access memory) devices. Sequential bytes in memory are commonly interleaved across the multiple DRAM devices to enable highly efficient access to multiples of a fixed quantity of data. Each such memory access usually includes at least one complete interleave of data across the DRAM devices. This data interleave strategy enables a graphics processing unit to access the locally attached DRAM devices in parallel and potentially achieve a peak memory bandwidth corresponding to the combined read or write bandwidth of all of the locally attached DRAM devices.
The storage elements internal to a modern DRAM device are typically organized into two or more banks. Each bank includes a two-dimensional (2D) array of storage cells organized by row and column number. Each bank also includes a row buffer, which is used to temporarily store data that is read from or written to the array by an input/output interface in the DRAM device. Data within the array is accessed through the row buffer one complete row of data at a time. Thus, a read from the array causes one complete row to be read from the array and stored within the row buffer. A write to the array causes one complete row of data from the row buffer to be written to the array. Prior to any external access to the row of data within the array, a correspondence is first established between the row of data and the row buffer. Establishing this correspondence is commonly referred to as “activating” the row of data and includes the steps needed to transfer data from the row of data within the array into the row buffer. Subsequent reads or writes to the active row are implemented as reads or writes to selected bits within the row buffer. If a read or write is requested on a different, new row of data within the bank, then the current row of data being stored in the row buffer is first written back to the corresponding current row of data within the array. The new row of data is then activated by reading the new row of data within the array into the row buffer. Once activated, the new row of data may be accessed through the row buffer.
The process of activating a row of data within a DRAM device takes time and incurs overhead. Fortunately, modern DRAM devices are able to process an activate command to one bank while simultaneously performing a pending read or write request on a different bank. In this way, the latency associated with an active operation may be “hidden” within the latency of performing the pending access request to a different bank. However, if sequential requests access different rows within the same bank, then the activate operations become serialized, resulting in significantly diminished performance.
The ability to simultaneously access and activate different banks within a DRAM enables a second interleave strategy, referred to herein as “bank interleaving,” which may be employed by a graphics processing unit to improve memory efficiency by interleaving sequential groups of bytes across the different banks within a DRAM device. The typical group of bytes interleaved across banks is on the order of the number of bytes within the row buffer or a fraction thereof. Interleaving groups of bytes across multiple banks during sequential accesses decreases the number of bank activation operations that cannot be hidden, resulting in higher overall performance.
The combined strategies of DRAM device and bank interleaving enable very efficient sequential access to DRAM memory for certain types of data. For example, image data that is sequentially refreshed to a display device is accessed with good efficiency most of the time using the two interleaving strategies. However, other kinds of data, such as texture map data, tend to be accessed predominantly in localized 2D regions that move in closed patterns within a larger 2D region. To accommodate data that exhibits 2D access locality, sequential bytes within memory are organized as 2D tiles within a 2D surface of a predefined size. Each tile includes sequential bytes that are interleaved across multiple DRAM devices and interleaved across parallel sets of banks within the DRAM devices. The tiles are laid out in memory in a block-linear organization to better accommodate 2D access locality. When the interleaving strategies described above are combined with the block-linear organization of a 2D surface within the DRAM devices, access “hot spots” naturally develop, whereby a disproportionate number of requests to the same bank within a given DRAM device are generated, leading to an increase in bank conflicts. Further, access hot spots can concentrate access requests within a reduced set of DRAM devices, undermining the efficiency of DRAM device interleaving (also referred to as “data interleaving”). These combined effects negatively impact overall system performance.
For example, when the number of physical memory tiles on a given surface within the local memory is an integral multiple of the number of banks in the local memory DRAM devices, then vertically adjacent tiles are typically mapped to the same DRAM bank number. Bank conflicts arise along the vertical boundary between two vertically adjacent tiles when the access sequence involves alternately accessing data within the two tiles, as would be the case when performing bilinear sampling along a texture space trajectory that straddles the two tiles. While the kind of data and size of the 2D surface determine the vertical alignment of banks and potential layout within the surface, there is really no efficient way for the graphics processing unit to exploit this information in order to avoid the performance degradation associated with memory access hot spots and bank conflicts.
As the foregoing illustrates, what is needed in the art is a technique for efficiently accessing the local memory of a graphics processing unit while avoiding memory access hot spots and bank conflicts.