Many modern microprocessors offer a method for preloading data to the cache before it is required. In this way memory access latencies can be hidden. However, if cache preloading is not performed in an efficient way, preloading can actually slow down rather than accelerate overall processing. This may be true for both trivial cases (such as a word-to-word memory copy) as well as more complex use cases (such as a bilinear scaling of graphics). A factor that potentially makes preloading even less efficient is that, in some processors such as advanced RISC machine (ARM) processors, preload instructions cannot be made conditional. This can result in preload behavior being rather unpredictable and inefficient.
Even where cache preloading is used, it is typically used in a simplified yet inefficient way. For example, in one approach, a predetermined number of pixels are always preloaded ahead of when the pixels are needed for processing. With such an approach, after loading the data for pixel “i,” the data for pixel “i+n+1” is preloaded, where n is the number of pixels that can be processed while preloading one cache line. However, such an approach may not be expected to provide much if any advantage, because data beyond the end of a line of pixels will be preloaded but never used, while data at the beginning of a line is not preloaded. For these and other reasons, cache preload instructions are not widely used in actual practice.