Many software compilers try to come up with the best array layout in memory for optimal cache performance [C. Ancourt, D. Barthou, C. Guettier, F. Irigoin, B. Jeannet, J. Jourdan, J. Mattioli, Automatic data mapping of signal processing applications, Proc. ntnl. Conf. on Applic.-Spec. Array Processors}, Zurich, Switzerland, pp.350-362, July 1997. ], [M. Cierniak, W. Li, Unifying data and control transformations for distributed shared-memory machines, Proc. of the SIGPLAN'95 Conf. on Programming Language Design and Implementation, La Jolla, pp.205-217, February 1995. ], [J. Z. Fang, M. Lu, An iteration partition approach for cache or local memory thrashing on parallel processing, IEEE Trans. On Computers, Vol.C-42, No.5, pp.529-546, May 1993. ] for approach based on compile-time analysis) but they do not try to directly reduce the storage requirements as memory is allocated based on the available variable declarations. Said compilers choose for each array as a whole a column- or row-oriented storage order and potentially some offset, for optimizing the cache performance. Moreover, the number of transfers to large memories or the amount of cache misses is not fully minimized this way. Consequently, there is a big loss in power consumption and overhead cycles (due to off-chip access).
Most of the work related to efficient utilization of caches has been directed towards optimization of the throughput by means of (local) loop nest transformations to improve locality [J. Anderson, S. Amarasinghe, M. Lam, Data and computation transformations for multiprocessors, in 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.39-50, August 1995. ],[N. Manjikian and T. Abdelrahman, Array data layout for reduction of cache conflicts, In Proc. of 8th Int'l conference on parallel and distributed computing systems, September, 1995. ] The most commonly known loop transformation for improving cache usage is loop locking [M. Lam, E. Rothberg and M. Wolf, The cache performance and optimizations of blocked algorithms, In Proc. ASPLOS-IV, pp.63-74, Santa Clara, Calif., 1991.]. Such approaches are focussed on changing the execution order of the application for cache usage optimization.
Some work [P. R. Panda, N. D. Dutt and A. Nicolau, Memory data organization for improved cache performance in embedded processor applications], In Proc. ISSS-96, pp.90-95, La Jolla, Calif., November 1996.] has been reported on the data organization for improved cache performance in embedded processors but they do not take into account a power oriented model. Instead they try to reduce cache conflicts in order to increase the throughput. Such approach is based on merging or clustering arrays found in the application in order to optimize cache performance. K. Pettis and R. C. Hansen, Profile guided code positioning, In ACM SIGPLAN'90 Conference on Programming Language and Design Implementation, pp. 16-27, June 1990. propose basic-block layout heuristics (they pack frequently used looping sequences) as well as procedure layout (they lay together procedures calling each other to minimize cache interference risks). But they still do explore this type of strategy for every individual data. Besides [D. C. Burger, J. R. Goodman and A. Kagi, The declining effectiveness of dynamic caching for general purpose mutliprocessor, Technical Report, University of Wisconsin, Madison, 1261, 1995] and [D. N. Truong, F. Bodin and A. Seznec, Accurate data distribution into blocks may boost cache performance, Technical Report, RISA, Rennes (France), 1996. ], very little has been done to measure the impact of data organization (or layout) on the cache performance. Moreover, said approaches deal with dynamic caching and do not try to influence data organization statically. Hence, the actual cycle time (or the number of iterations) taken to obtain a good data layout is quite large.
One can conclude that in the current state of the art compile-time system-level optimization strategies focussed on cache utilization, aiming at low power, based on static data organization of the main memory and fully exploiting the knowledge of cache parameters and program characteristics, is not found.