The present invention generally relates to a computing system. More particularly, the present invention relates to prefetching data to improve a performance of the computing system.
Prefetching refers to a technique used in a processor to improve processor speed. Traditionally, prefetching places data in a cache memory before the data is needed. Thus, when the data is needed, the data can be provided to the processor more quickly because the data already resides in the cache memory before being requested.
Traditionally, in a parallel computing system (e.g., IBM® Blue Gene®\L or Blue Gene®\P, etc.), a prefetch engine (i.e., a hardware module performing the prefetching) prefetches a fixed number of data streams with a fixed depth (i.e., a certain number of instructions, or a certain amount data to be fetched ahead) per a processor core or per a thread. However, this traditional prefetch engine fails to adapt to a data rate or a speed (e.g., 100 megabyte per second) of each data stream. This traditional stream prefetch engine does not prefetch proper data (i.e., data to be consumed by a processor core) ahead when a memory access pattern follows a complex pattern corresponding to non-consecutive memory addresses or when a processor core runs code in a repetitive manner (e.g., a “for” or “while” loop).
Therefore, it is desirable to improving a performance of a parallel computing system by operating at least two different prefetch engines, each of which prefetch a different set of data stored in a memory device according to these two different types of pattern (consecutive addresses or a random block of addresses but in a pattern in which a same memory block is repeatedly accessed).