1. Field of the Invention
The present invention generally relates to data processing, and more specifically, the invention relates to finding an optimal data layout. Even more specifically, the invention relates to a procedure for trying different data layouts and then using the evidence to decide objectively the best of these layouts.
2. Background Art
Cache and TLB misses often cause programs to run slowly. For example, it has been reported that the SPECjbb2000 benchmark spends 45% of its time stalled in misses on an Itanium processor (Ali-Reza Adl-Tabatabai, Richard L. Hudson, Mauricio J. Serrano, and Sreenivas Subramoney, “Prefetch Injection Based on Hardware Monitoring and Object Metadata”). Cache and TLB misses often stem from a mismatch between data layout and data access order. For example, FIG. 1 shows that the same data layout can degrade or improve runtime depending on how well it matches the program's data accesses, and on how expensive the layout is to apply. Results like those in FIG. 1 are typical: optimizations that improve performance for some programs often risk degrading performance for other programs. The results depend on tradeoffs between optimization costs and rewards, on interactions between complex software and hardware systems.
Picking the best data layout a priori is difficult. It has been shown that even with perfect knowledge of the data access order, finding the optimal data placement, or approximating it within a constant factor, is NP-hard (Erez Petrank and Dror Rawitz, “The Hardness of Cache Conscious Data Placement”, In Principles of Programming Languages, (POPL), 2002). Others have shown that finding a general affinity-hierarchy layout is also NP-hard (Chengliang Zhang, Chen Ding, Mitsunori Ogihara, Yutao Zhong, and Youfeng Wu, “A Hierarchical Model of Data Locality. In Principles of Programming Languages (POPL), 2006). Practically, picking a data layout before the program starts would require training runs and command line arguments, both of which impede user acceptance.
Another option is to pick the best data layout automatically and online, while the program is executing. This also facilitates adapting to platform parameters and even to phases of the computation. The usual approach for this is to collect information about program behavior, then optimize the data layout, and possibly repeating these steps to adapt to phases. This approach requires tradeoffs: collecting useful information without slowing down the program too much, and transforming the information into the correct optimization decisions for the given platform. Getting these tradeoffs right requires careful tuning.
Driving a data layout optimization with profile information leads to a tightly woven profiler/optimizer co-design. For example, when a copying garbage collector performs the optimization, the collector design is geared towards using a profile. Published research prototypes usually compromise other design goals. For example, most locality-improving garbage collectors are sequential, compromising parallel scaling. In addition, such a design buries what is essentially a machine-learning problem in a complex system, out of reach for off-the-shelf machine learning solutions.