1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to a method and an apparatus for generating code for a helper-thread that prefetches data values for a main thread.
2. Related Art
As the gap between processor performance and memory performance continues to grow, prefetching is becoming an increasingly important technique for improving application performance. Currently, prefetching is most effective for memory streams where future memory addresses can be easily predicted. For such memory streams, software prefetching instructions are inserted into the machine code to prefetch data values into cache before the data values are needed. Such a prefetching scheme is referred to as “interleaved prefetching.”
Although successful for certain cases, interleaved prefetching tends to be less effective for two types of codes. The first type are codes with complex array subscripts, but with predictable patterns. Such codes often require more computation to determine the addresses of future loads and stores, and hence incur more overhead for prefetching. This overhead becomes even larger if such complex subscripts contain one or more other memory accesses. In this case, prefetches and speculative loads for the memory accesses are both required to form the base address of the prefetch candidate. If the data items targeted for prefetching are already in the cache, such large overhead may actually cause significant execution time regression instead of improving performance. In order to avoid such a penalty, modern production compilers often ignore prefetch candidates with complex subscripts or only prefetch data speculatively one or two cache lines ahead.
The second type of codes for which interleaved prefetching is ineffective are codes which involve pointer-chasing references. For such references, at least one memory address must be retrieved to get the memory address for the next loop iteration. This dependency eliminates the advantage of interleaved prefetching.
Various techniques have been proposed to handle the cases where interleaved prefetching is ineffective. For example, some researchers have proposed using a “jump-pointer” approach (see A. Roth and G. Sohi, Jump-pointer prefetching for linked data structures, Proceedings of the 26th International Symposium on Computer Architecture, May 1999.) Unfortunately, the jump-pointer approach requires analysis of the entire program, which may not be available at compile-time.
Other researchers have tried to detect the regularity of the memory stream at compile-time for Java applications (see Brendon Cahoon and Kathryn McKinley, “Data flow analysis for software prefetching linked data structures in Java,” Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, 2001.)
Yet other researchers have tried to detect the regularity of the memory stream with value profiling (see Youfeng Wu, “Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching,” Proceedings of the International Conference on Programming Language Design and Implementation, June 2002.) This technique requires significant additional steps related to compilation. Furthermore, the technique's accuracy depends on how close training and reference inputs match each other and how many predictable memory streams exist in the program.
Recently developed chip multi-threading (CMT) architectures with shared caches present new opportunities for prefetching. In CMT architectures, the other core (or logical processor) can be used to prefetch data into a shared cache for the main thread.
“Software scout threading” is a technique which performs such prefetching in software. During software scout threading, a scout thread executes in parallel with the main thread. The scout thread does not perform any real computation (except for necessary computations to form prefetchable addresses and to maintain approximately correct control flow), so the scout thread typically executes faster that the main thread. Consequently, the scout thread can prefetch data values into a shared cache for the main thread. (For more details on scout threading, please refer to U.S. Pat. No. 6,415,356, entitled “Method and Apparatus for Using an Assist Processor to Pre-Fetch Data Values for a Primary Processor,” by inventors Shailender Chaudhry and Marc Tremblay.)
Software scout threading naturally handles the cases where interleaved prefetching is ineffective. For complex array subscripts, prefetching overhead is migrated to the scout thread. For pointer-chasing codes, software scout threading tries to speculatively load or prefetch values for instructions which actually cause a cache miss.
Unfortunately, software scout threading is not free. The process of launching the scout thread and operations involved in maintaining synchronization between the main thread and the scout thread can create overhead for the main thread. Such overhead must be considered by the compiler as well as the runtime system to determine whether scout threading is worthwhile. Furthermore, existing techniques for scout threading tend to generate redundant prefetches for cache lines that have already been prefetched. These redundant prefetches can degrade system performance during program execution.
Hence, what is needed is a method and an apparatus for reducing the impact of the above-described problems during software scout threading.