1. Technical Field
The present invention relates to a system and method for efficient implementation of software-managed cache. More particularly, the present invention relates to a system and method for using a conditional data select instruction and a zero-length data transfer operation for eliminating a conditional branch instruction in a software managed cache.
2. Description of the Related Art
Certain processing tasks involve multiple references to data elements that are confined to a relatively small data block (e.g., cache line), such as encoding or decoding video information. For example, video software may repeatedly reference incoming color information from a “macro-block” and, in this case, the references are considered to have a “strong spatial locality.”
In addition, certain processing tasks involve a series of read-only references to a particular data block, such as rendering software generating a texture image. For example, the rendering software may sample and filter (average) multiple nearby “texels.” Again, such references are considered to have strong spatial/temporal locality since multiple nearby texels are referenced one after another in order to perform the filtering operation.
One approach to handle the above processing tasks is to use a processor that is able to 1) execute software that supports logically complex tasks, and 2) be fast/capable enough to process significant amounts of data. Such a processor, however, may be large, complex, include a coherent data cache, and operate at a high frequency. Unfortunately, these processors are not typically power and cost effective.
Another approach to handle such tasks is to divide the processing workload among one or more “simple processing elements,” each of which having a small but high speed local memory, coherent asynchronous DMA capability, a large register file, and a SIMD ISA in order to allow high compute performance at improved size/power efficiency. To achieve this higher performance and efficiency, however, simple processors discard much of the complexity of a larger processing core by eliminating hardware caches, load/store capabilities, and branch prediction logic. As such, simple processors may use a “software managed cache” that uses a set of data blocks for reducing latency from main memory to local memory. The software-managed cache may be implemented as direct mapped, n-way set associative, or fully associative, depending upon requirements.
For data references to a software-managed cache, the cache occasionally misses, such as during the cache's initial state. When a cache miss occurs, existing art “conditionally branches” to a “cache miss handler,” which updates the cache with the requested data block. A challenge found, however, is that conditionally branching to a cache miss handler creates performance bottlenecks for a simple processing element because the simple processor does not have advanced branch prediction logic that is often part of a more complex processor. Therefore, the simple processor stalls in order to fetch the next instruction because the branch target is typically not in line with the current instruction stream.
Existing art may insert branch hint instructions to instruct the simple processor to prefetch instructions. A challenge found, however, is that this approach is not effective when multiple conditional branch instructions follow closely one after another, such as in the case of successive cache queries (i.e. video processing and texture mapping).
What is needed, therefore, is a system and method that effectively handles cache misses in a simple processing element.