Accessing computer memory for image and video processing functions imposes different requirements than accessing computer memory for carrying out general-purpose computing functions. In particular, many image/video processing functions are characterized by high spatial locality, meaning that the functions require access to pieces of data that are stored in close proximity to each other within memory. Typically, image data are stored in consecutive blocks of memory, and image functions, such as frame averaging and two-dimensional transposition, generally require sequential access to the consecutive blocks of data. However, image/video processing functions characteristically have little temporal locality, meaning that these functions typically don't need to reuse the same pieces of data after a short period of time. For example, functions such as frame averaging and two-dimensional transposition generally do not reuse the same blocks of data after a short period of time.
Cache memories are well suited to temporarily store data for repeated access by a processor. Thus, cache memories are best employed when functions are executed that have sufficient temporal locality, so that the data stored in the cache can be reused often. However, caches are not well suited for functions having primarily spatial locality. The ability of caches to exploit spatial locality is limited due to the relatively small size of cache lines, where a cache line is the smallest unit of memory that can be transferred between main memory and the cache. (Cache lines are also sometimes referred to as cache blocks.)
Many media processors try to overcome the limitations of caches by replacing or supplementing them with direct memory access (DMA) controllers. Double buffering has become a popular programming technique when utilizing DMA controllers and takes advantage of the static and simple memory references in most image/video computing functions. With double buffering, the DMA controller transfers data to an on-chip buffer while the processor uses data stored in another on-chip buffer as its input. The roles of the two buffers are switched when the DMA controller and the processor are finished with their respective buffers.
Double buffering overlaps computation and memory transfers. This overlap hides memory latency very effectively. In addition, the memory bandwidth obtained is typically higher with DMA transfers than those obtained when fetching data from cache lines. There are two reasons for this. First, most modem main memory designs enable the address and data phases to be decoupled, so that addressing and data access periods can be overlapped. An example of this type of memory is RAMBUS™ dynamic random access memory (RDRAM). These main memories typically operate most efficiently when the supply of read addresses is uninterrupted and pipelined, which is possible with DMA data transfers. A continuous supply of addresses is more difficult to guarantee when using a cache, because a cache miss only results in a few words of data being loaded from main memory. In fact, a continuous supply of addresses is impossible unless the cache is non-blocking, meaning that the processor is not blocked (stalled) from continuing to execute subsequent instructions during a cache miss. Of course, the processor is allowed to execute subsequent instructions only if the subsequent instructions do not use the data being loaded by a cache miss service. In double buffering, a block of data is typically large enough that the DMA controller will typically fetch a longer portion of a dynamic random access memory (DRAM) page than would be fetched during a cache miss. Since DRAMs are most efficient when accessing data within a page, double buffering also improves the data transfer bandwidth.
The use of double buffering enables computation-bound functions to minimize memory stalls, since it effectively hides the memory latency behind continued computing time. For memory-bound functions, efficient bandwidth utilization directly translates into better performance, because execution time is highly correlated with the memory bandwidth obtained.
The disadvantage of using DMA controllers for double buffering is that they make programming significantly more difficult. A DMA controller must be programmed separately from the main data processing. The DMA controller must also be properly synchronized to the program running on functional units. The programmer must keep track of where the data are stored and explicitly perform transfers between on-chip and off-chip memories. Current compiler technologies are unable to simplify most of these tasks. Thus, substantial programming effort expended in developing an image computing function is directed to establishing correct and efficient DMA data transfers.
It would be desirable for a cache to mimic the efficient memory addressing characteristics of functions running on a DMA controller to ensure that memory bandwidth utilization is high, while avoiding the need for difficult and time-consuming DMA programming. It would also be desirable to prefetch blocks of data larger than a cache line sufficiently early to reduce cache miss penalties.
A particular concern with prefetching large blocks of memory is that a misprediction of the data that are needed will result in a large amount of useless data being transferred to the processor, since a prefetch is useful only when the prefetched data are employed by the processor before the data are replaced. High prefetching accuracy is therefore needed to avoid useless prefetches. Achieving a high accuracy in this task by using suitable hardware would require significant on-chip space, and it might take a significant amount of time for the hardware to collect the necessary information, such as memory addresses, from run-time information. Any delay in this decision-making process will incur cache misses early in the execution.
For these reasons, it would be desirable to use compile-time information to aid in prefetching. Preferably, such compile-time information would be determined indirectly from instructions (hints) provided by a programmer or compiler. For example, hints provided by the programmer or compiler could identify the region of data and a general direction in which to prefetch the data. This concept of providing programmed hints is referred to herein as program-directed prefetching (PDP). Although PDP requires the programmer's active role in creating the hints, the programming effort can be significantly reduced since the programmer does not have to deal with the complicated data transfer synchronization problem. Furthermore, since no DMA programming interface, which is architecture dependent, would be required, the portability of functions would be improved by providing a cache prefetcher mechanism such as PDP.