1. Field of the Invention
The present disclosure generally relates to techniques and systems for enhancing operation of electronic devices comprising central processing units performing frequent data array accesses.
2. Description of the Related Art
The continuous advance in the field of semiconductor manufacturing resulted in the fabrication of fast and powerful integrated circuits, which may include millions of individual transistor elements. Consequently, highly complex digital circuitry has been developed and used for designing and producing sophisticated central processing units (CPU) wherein increased packing density in combination with reduced power consumption and high internal storage capacity has fueled a plurality of developments in integrating sophisticated CPUs into a wide variety of electronic devices.
Typically, a CPU may be operated on the basis of a dedicated byte code or machine code, which may result in a corresponding change of state of hardware components, such as registers, I/O (input/output) terminals and the like, in accordance with the sequence of machine code instructions. Thus, at the deepest level of a communication between an application and the hardware system, including the CPU or any other data and instruction processing unit, the corresponding sequence of byte code instructions has to be executed by the CPU, thereby providing the desired result in the form of register contents and the like. Due to continuous advances in the process technology typically used for forming complex integrated circuits such as CPUs, performance of digital circuitry has been significantly improved, thereby providing enormous computational resources for performing various tasks. For instance, very complex image processing applications, scientific calculations, including the modeling of complex situations, and the like may be performed on the basis of well-established computer systems including advanced microprocessors. One important aspect in enhancing performance of microprocessors and the like has been seen in continuously increasing the clock signal frequency, i.e., the speed of processing a sequence of machine code instructions one by one. This may typically be accomplished by reducing the feature sizes of individual transistor elements, thereby also reducing the resistance of critical signal paths and the like. Moreover, due to the reduced dimension, interrelated hardware modules may be positioned on a single semiconductor chip at high package density, thereby also contributing to superior operating speed. Moreover, a significant degree of parallelism may be implemented into microprocessor devices by providing a plurality of sub-modules, for instance for enabling the execution of a plurality of instructions in a more or less parallel manner and/or by accessing data arrays in a parallel way.
However, it turns out that the technological advances in microprocessor architectures and manufacturing techniques may not translate into corresponding advances in the performance memory devices in the same way. For example, extremely high bit densities may be attained on the basis of certain memory architectures, such as dynamic RAM devices (random access memory), which, for instance, may comprise a single storage transistor in combination with a charge storing element, such as a capacitor. On the other hand, typically, the high bit density may be associated with a moderately long time interval for accessing the individual memory locations, for instance due to signal propagation delay caused by charging and discharging the storage elements, the corresponding conductive lines connecting the individual memory locations with each other and the like. Hence, despite sophisticated hardware environments, the advantages obtained by a very high operating speed of the central processing unit may be offset by the increased latency induced by the complex memory device. On the other hand, fast memory devices, such as registers and the like, provide very short access time, but may have a higher degree of complexity, for instance in terms of number of transistors, thereby requiring significant floor space on the semiconductor die, if moderately large memory areas are to be integrated into the semiconductor chip. For this reason, typically, appropriate memory space may be provided in close proximity to the processing modules of a processor, however, with a very restricted memory capacity, while other memory devices, such as dynamic RAM devices and the like, may be provided in peripheral areas of the semiconductor chip or may typically be provided as external devices, which may be connected to the central processing unit via an appropriate bus system.
Consequently, by implementing appropriate hardware and software components, the latency induced by a high density storage may be significantly reduced by using fast buffer memories, which may also be referred to as cache memories, in which frequently accessed memory locations of the main memory system may be copied and may thus be made available for the central processing unit without significant latencies. For instance, in very fast cache memories, the operating speed may be determined by the same clock frequency as is used in the CPU core. In order to use a cache memory in an efficient manner, it may be taken advantage of the fact that, in a complex sequence of instructions representing any type of application, certain instructions may be frequently executed two or more times with only several other instructions being executed in between, so that a corresponding block of instructions may be maintained within a cache memory accessed by the CPU and may be dynamically adapted according to the advance of the program. Similarly, in many types of program sequences, the same memory location may be accessed several times within a very restricted sequence of program code so that the corresponding contents may be stored in a cache memory and may be efficiently accessed by the central processing unit at high speed. However, due to the very limited storage capacity of the cache memory, only a small part of the main memory may be maintained within the cache memory at a time.
Consequently, appropriate hardware and software strategies have been developed in order to obtain a high rate of “cache hits,” which may be considered as memory operations performed on memory locations, a copy of which is still maintained in the fast cache memory so that memory operations can be executed by using the cache. In other cases, large data arrays may have to be maintained in the main memory, for instance when storing digital images and the like, wherein usually the data may occupy a contiguous sub-array of the memory. Furthermore, in many types of programs, exhaustive data accesses may be required to operate on data arrays, wherein accessing one array item may be associated with accessing another array item that is positioned in the “neighborhood” of the previously-accessed memory location. Consequently, by copying a portion of the neighborhood of the memory location currently being accessed by the central processing unit into the cache memory, there is a high probability that one or more subsequent memory accesses may result in a cache hit. In this manner, the existing gap between microprocessor performance and performance of main memory systems, such as DRAM devices, may be reduced by using appropriate techniques designed to reduce or hide the latency of the main memory accesses on the basis of strategies as described above.
Although these strategies in combination with appropriately designed cache memory hierarchies, i.e., cache memories of different levels of performance, have been very effective in reducing latency for the most frequently accessed data, in still many applications, the entire runtime may nevertheless be substantially determined by wait cycles of the central processing unit due to frequent memory accesses to the main memory system. For example, a plurality of scientific calculations, image processing applications and the like may include large data arrays, which may have to be frequently accessed. In this situation, performance of a computer system may be enhanced by additional strategies, such as optimizing the source code of the application under consideration and the like, wherein processor specific characteristics may be taken into consideration in order to optimize the available resources of the platform of interests. For example, one very efficient tool for optimizing an application is the so-called prefetching technique, in which instructions and/or data may be fetched from the main memory system ahead of the actual execution or processing of the instructions and data in the central processing unit. That is, in case of data prefetching, the main memory system may be accessed in order to copy a portion of a data array into the cache memory, which is expected to be accessed later on in the program. A data prefetching technique may be divided into two categories, that is, software initiated prefetching and hardware initiated prefetching. Software initiated data prefetching may be considered as a technique in which additional instructions may be inserted into the initial program code, which may typically be accomplished on the basis of compiler modules, which convert an initial instruction set, typically provided as a source code written in a high level language, such as C++, Java, Fortran and the like, into a machine code instruction set that is executable by a specific microprocessor platform. For this purpose, typically, the platform may support a type of prefetch instruction which may result in a memory access in order to copy a memory location, typically in combination with the corresponding neighborhood, in the cache memory, while a central processing unit may still execute instructions, which are currently not requiring the contents of the memory location that is presently prefetched. In order to obtain high efficiency of the data prefetching technique, two criteria are to be taken into consideration. First, the data to be prefetched should preferably represent data that would result in a “cache miss” at the time when the corresponding instruction referring the memory location under consideration is executed. For example, any prefetch operations issued for data that are already in the cache memory would result in additional overhead and would contribute to enhanced complexity and thus increased run time. Second, the issuance of the prefetch operation during run time has to be appropriately scheduled so that the data of interest are in the cache memory when a corresponding memory access for this data is executed by the central processing unit. Consequently, an appropriate insertion of prefetch instructions into an existing program code may require a corresponding analysis of the program sequence, wherein any benefits and possible disadvantages caused by the additional instructions may also have to be balanced with respect to each other in order to obtain a significant performance gain during run time of the program.
Promising candidates for enhancing performance by data prefetching by insertion of additional prefetch instructions during compile time are program loops, in which a sequence of instructions may be frequently repeated. For example, when operating on a data array on the basis of one or more loops, which will represent nested loops, depending on the dimensionality of the data array under consideration, the memory accesses may depend on the loop variable, i.e., the loop counter, in a very predictable manner, so that corresponding memory addresses can be identified with data prefetch operations at an appropriate time, that is, at some appropriate number of iterations of the loop of interest ahead so that corresponding data may be available when accessed during a later iteration of the loop. Efficient data prefetching strategies during compilation of the source code have been developed in a context of optimizing loop processing by using a certain degree of parallelism during the program. For example, certain types of loops, or at least portions thereof, may allow parallel processing, for instance by operating on data arrays in a parallel manner, which may in the source code initially be accessed by a single instruction.
FIG. 1 schematically illustrates a table containing a loop defined by a loop counter or loop variable i, which determines the number of iterations of the loop. In the example shown, the loop counter i varies between 1 and 1000 with a step width or stride 1. Furthermore, the loop memory accesses have to be performed on the basis of data arrays a, b and c. Consequently, in the instruction contained in the loop, the data arrays a, b, c have to be accessed in each iteration. Hence, by providing appropriate resources in a microprocessor, the loop of FIG. 1 may be vectorized by performing a plurality of memory accesses in parallel. For instance, if four data items may be processed in parallel, the loop may require only one-fourth of the initial iterations, thereby significantly enhancing overall performance. For this reason, in many processor architectures, appropriate resources are implemented, such as SIMD (single instruction, multiple data) instructions, which are highly efficient in increasing overall processing speed, which may even further be enhanced by data prefetching techniques in order to ensure that data required by the parallel processing are available in the cache memory at the appropriate point in time. Consequently, in sophisticated compiler systems, a mechanism for identifying appropriate candidates for data prefetching and for inserting the corresponding prefetch instructions into the sequence of instructions is usually tied to a vectorization phase during the compilation.
However, other important prefetching opportunities are nevertheless present in loops, which are not selected for vectorization, when corresponding prerequisites for the vectorization mechanism are not fulfilled. Such loops may hereinafter be referred as scalar loops. Scalar loops often involve loop bodies with multiple basic blocks or loops that are unrolled during separate phases of the compiling process. For this reason, the prefetching implementation integrated with a vectorization phase for vectorizable loops cannot be readily extended to deal with scalar loops.
For this reason, in many research activities, mathematical models have been developed in order to track and summarize how array memory locations are accessed by the various loop nests and program constructs of sophisticated applications. The implementation of corresponding modules may require significant effort and may also contribute to extended compilation times, wherein, however, the effectiveness is difficult to predict given the complexity in a memory sub-system and the complex interactions between software and the hardware components.
For example, V. Santhanam, E. Garnish, W. Hsu in “Data Prefetching on the HPPA-8000.” Proceedings of International Symposium of Computer Architecture (ISCA), pages 264-273, 1997 disclose a compiler data prefetching framework that targets array element accesses, requiring a complex process strategy. Jeanne Ferrante, Vivek Sarkar, W. Thrash in “On Estimating and Enhancing Cache Effectiveness,” Proceedings of Languages and Compilers for Parallel Computing, 4th International Workshop, pages 328-343, August 1991, and S. Ghosh, M. Martomosi, S. Milik in “Cache Misequations Having a Compiler Framework for Analyzing and Tuning Memory Behavior,” ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 21, Issue 4, pages 703-746, 1999 proposed two respective modules to track and represent memory access patterns in programs to guide different memory optimizations, which include prefetching. Moreover, C. Luk, T. Mowry, “Compiler Based Prefetching for Recursive Data Structures,” ACM SIGOPS Operating Systems Review, Vol. 30, Issue 5, pages 222-233, 1996 disclose a research work for generating prefetches for recursive data structures accessed through pointer d references, however, without targeting array accessing or indirect accesses through indexed arrays.
In view of the situation described above, the present disclosure relates to efficient prefetching techniques on the basis of prefetch instructions while avoiding or at least reducing one or more of the problems identified above.