Cache memory is memory configured to provide data to one or more processors very quickly. Data stored in the cache memory can be accessed more quickly but is more expensive than slower memory. Accordingly, it is generally limited in size and is typically reserved for storing data that is most likely to be needed by the one or more processors during execution of a program. The selection of data for the cache generally relies on data having some form of locality. In other words, data likely to be needed next in execution of a program is stored near data just used. By populating the cache with data stored nearby recently used data, that data is more likely to be needed. Further access patterns may exhibit temporal locality. In other words, data that was just used, may be soon needed again.
Traditional hardware prefetchers suffer from prefetch redundancy and prefetch insertion timing problems. Typical software based preload instructions can only preload one cacheline per instruction. In order to calculate the address of the cacheline, such software requires more load/store and ALU instructions, which severely diminish the benefit of such preloads. Some implementations do have a preload engine which targets a specific address pattern which may hardly benefit real world cases.