Although computation power of programmable processors is still increasing in that operations are executed at higher rates, by increasing clock frequencies, or in parallel with other operations, by using parallel architectures such as Very Large Instruction Word processors or superscalar processors, the overall performance of systems based on these processors is often hampered by limitations in the bandwidth of peripheral devices, such as an Input/Output (I/O) device, a physical memory or a data bus. In order to alleviate these limitations, (multilevel) caches may be incorporated in these systems to keep data local to the processor as much as possible, thereby decreasing the required data bandwidth for retrieving data from more distant parts of the system, as disclosed in U.S. Pat. No. 6,574,707 B2. Furthermore, whenever bus or memory bandwidth, for example, is needed, e.g. on a cache miss, efficient use of this bandwidth is made by using a so-called burst operation by which multiple data elements are packed into a single atomic operation, requiring less control overhead. Typically, programmable processors generate I/O requests via read operations and write operations working on a single data element. By using a cache these single data element operations are automatically converted to burst operations, since a cache provides an interface to the processor, whereby the processor is serviced using single element operations, while the other parts of the system are typically accessed using burst operations. The latter holds in particular for read operations, where in case of a cache miss, a cache will fetch an entire cache line that is requested from the system using one or more burst operations. Processor writes in the presence of a cache may either result in burst behavior or single element access dependent on the cache write policy used. While a cache using a “write-through” policy will write single data elements to the system, a cache using the so-called “write-back” write policy will predominantly write complete cache blocks to the system in burst mode. As long as a write hits in the cache, only the data in the cache will be updated. Only when the cache block that has been changed because of such a write hit (i.e. has become “dirty”) has to be evicted from the cache to make room for a newly fetched block to be stored at the same cache location, the dirty block will be written back to the system. In the case of a write miss, the cache will either fetch the missing block of data and subsequently write to the fetched block in the cache (“write back” with “write allocate” policy) or it will bypass the cache and directly write a single data element to the system (“write back” with “no write allocate” policy).
Especially in case of cost-sensitive and low-power applications, the use of a cache has major drawbacks in terms of area and power dissipation overhead. Furthermore, for real-time systems, the dynamic behavior of a cache makes predicting guaranteed real-time performance difficult. For signal processing applications that process data streams, a conventional cache mostly has little performance benefits since data items are often read and written once, and no temporal data locality can be exploited. For these reasons, embedded systems may only use a relatively simple cache, or no cache at all. A relatively simple cache uses a write-through with no-write-allocate write policy, i.e. in case data are written at a memory address present in the cache, the data are written both in the cache and the memory, whereas in case data are written at a memory address not present in the cache, the data are only written in the memory without retrieving them from the memory into the cache. In such embedded systems hardwired accelerators are often designed such that they will perform system I/O in a burst manner. Since these accelerators are tuned to a specific application it is usually feasible to tune the accelerator to the system environments in which it will be applied, so that this kind of behavior is ensured. The demand for more flexible systems-on-chip leads to a situation where programmable accelerators are increasingly used. Such programmable processors are often based on load/store architectures where these processors communicate with a system using read and write operations that work on single data elements, that is, each read/write operation consists of a single address wherefrom a single data element matching the processor's data path width (e.g. a 32 bit word) should be read or an address at which a single data element should be written, under the control of a software program. Although the programmer may map (signal processing) applications that access system data in a streaming manner there is usually no way for the programmer to control how the processor accesses system data. If a cache is not present in the system, the single data element operations will go straight to the system bus or memory, for example, leading to inefficient use of the available bandwidth due to the overhead of setting up a new transfer for each individual data item. This will result in a poor bus/memory bandwidth usage and may severely impact the overall system performance.