Streaming data in the context of a processor (CPU) is generally a sequence of store or write instructions that write data to consecutive or contiguous memory locations in virtual space. Often a large block of data will be moved or stored to memory via a series of write or store operations. A typical example of a streaming data or “store streaming” is a “memory copy”, which is a commonly used method that copies a block of memory from a source location to a destination location. In hardware, this method translates to a stream of loads or read operations fetching data from the source location, followed by a stream of stores or write operations that copy the loaded data to the destination location. Some applications may simply utilize store streaming to initialize a large block of memory.
At times, these store streams are non-temporal. That is the data is often referenced only once and then not reused in the immediate future. For example, a typical memory copy operation may involve moving several kilobytes or megabytes of data that may only be referenced once during program execution. Caching the store data within the processor's caches (e.g., a level 2 (L2) cache, a level 1 (L1) cache, and a level 3 (L3) cache) can displace other useful cache-resident data, and be detrimental to performance.
Often, to avoid cache pollution, applications may attempt to provide an indication (e.g., through an instruction operation code, and a memory type) to enable the hardware to know that the streamed data is not to be cached. However there may be instances when the indication is not available within the instruction set. To address such concerns, many hardware designs incorporate a mechanism to dynamically detect the pattern of stores, and look for the case of store streaming patterns of large sizes (in order to stream them directly to system memory).
Using the above approach of hardware store streaming detection tends to avoid the general problem of cache pollution. However some applications (e.g., compilers, and so on) tend to temporally re-access store streams of reasonably large sizes that would otherwise fit within L2 or L3 caches. For such applications, caching would have been more beneficial. However, using the traditional hardware detection approach, those store streams would be written to memory repeatedly, incurring system memory bandwidth and power usage, and foregoing the benefits of cache storage.