1. Field of the Invention
The present invention relates generally to the field of processors, and specifically, to a method and apparatus for implementing non-temporal stores.
2. Background Information
The use of a cache memory with a processor is well known in the computer art. A primary purpose of utilizing cache memory is to bring the data closer to the processor in order for the processor to operate on that data. It is generally understood that memory devices closer to the processor operate faster than memory devices farther away on the data path from the processor. However, there is a cost trade-off in utilizing faster memory devices. The faster the data access, the higher the cost to store a bit of data. Accordingly, a cache memory tends to be much smaller in storage capacity than main memory, but is faster in accessing the data.
A computer system may utilize one or more levels of cache memory. Allocation and de-allocation schemes implemented for the cache for various known computer systems are generally similar in practice. That is, data that is required by the processor is cached in the cache memory (or memories). If a cache miss occurs, then an allocation is made at the entry indexed by the access. The access can be for loading data to the processor or storing data from the processor to memory. The cached information is retained by the cache memory until it is no longer needed, made invalid or replaced by other data, in which instances the cache entry is de-allocated.
Recently, there has been an increase in demand on processors to provide high performance for graphics applications, especially three-dimensional graphics applications. The impetus behind the increase in demand is mainly due to the fact that graphics applications tend to cause the processor to move large amounts of data (e.g., display data) from cache and/or system memory to a display device. This data, for the most part, is used once or at most only a few times (referred to as "non-reusable data").
For example, assume a cache set with two ways, one with data A and another with data B. Assume further that data A, data B, and data C target the same cache set, and also assume that a program reads and writes data A and data B multiple times. In the middle of the reads and writes of data A and data B. if the program performs an access of non-reusable data C, the cache will have to evict, for example, data A from line one and replace it with data C. If the program then tries to access data A again, a cache "miss" occurs, in which case data A is retrieved from external memory and data B is evicted from line two and replaced with data A. If the program then tries to access data B again, another cache "miss" occurs, in which case data B is retrieved from external memory and data C is evicted from line one and replaced with data B. Since data C is non-reusable by the program, this procedure wastes a considerable amount of clock cycles, decreases efficiency, and pollutes the cache.
Therefore, there is a need in the technology for a method and apparatus to efficiently write non-reusable data to external memory without polluting cache memory.
A further bottle neck in data intensive applications such as three-dimensional applications, in addition to the processor, is the memory and bus bandwidth. That is, data intensive applications require a considerable amount of bus transactions to and from system memory.
Therefore, there is an additional need in the technology for a method and apparatus to efficiently write non-reusable data to external memory without polluting cache memory while minimizing bus transactions.