1. Field of the Invention
The present application relates in general to systems, methods and software for analyzing, parallelizing, debugging, optimizing and profiling computer systems and, more specifically to capturing application characteristic data from the execution of a system or multiple systems, and modeling system behavior based on such data.
2. Description of the Related Art
Increasing demands to improve software efficiency with ever-increasing system complexity have dictated the use of tools to evaluate target software operation, identify inefficiencies, suggest and/or implement improvements, optimize software operation, etc. Optimization and profiling tools often embed monitoring code into the target software under scrutiny and/or create a real or simulated (e.g., model) run-time environment that interacts with the target software to analyze its operation. See, for example, “StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis”, E. Berg and E. Hagersten, Technical report 2003-58 Dept. of Information Technology, Uppsala University, Uppsala, Sweden, November 2003, Proceedings of the 2004 IEEE International Symposium on Analysis of Systems and Software (ISPASS-2004), Austin, Tex., USA, March 2004; “Low Overhead Spatial and Temporal Data Locality Analysis”, E. Berg and E. Hagersten, Technical report 2003-57 Dept. of Information Technology, Uppsala University, Uppsala, Sweden, November 2003; and, A Statistical Multiprocessor Cache Model by Erik Berg, Hakan Zeffer, and Erik Hagersten. In Proceedings of the 2006 IEEE International Symposium on Analysis of Systems and Software (ISPASS-2006), Austin, Tex., USA, March 2006, each of which is incorporated herein by reference in its entirety.
Other publications applicable to the related technology include:                STATSHARE: A Statistical Model for Managing Cache Sharing via Decay by Pavlos Petoumenos, Georgios Keramidas, Håkan Zeffer, Stefanos Kaxiras, and Erik Hagersten. In 2006th Workshop on Modeling, Benchmarking and Simulation held in conjunction: with the 33rd Annual International Symposium on Computer Architecture, Boston, Mass. USA, June, 2006.        Modeling Cache Sharing on Chip Multiprocessor Architectures by Pavlos Petoumenos, Georgios Keramidas, Håkan Zeffer, Erik Hagersten, and Stefanos Kaxiras. In Proceedings of the 2006 IEEE International Symposium of Workload Characterization: San Jose, Calif., USA, 2006.        
An ideal profiling tool should have low run-time overhead and high accuracy, it should be easy and flexible to use, and it should provide the user with intuitive and easily interpreted information. Low run-time overhead and high accuracy are both needed to efficiently locate bottlenecks with short turn-around time, and the ease-of use requirement excludes methods which need cumbersome experimental setups or special compilation procedures.
It is unfortunately hard to combine all the requirements above in a single method. For example methods based on hardware counters usually have a very low run-time overhead, but their flexibility is limited because hardware parameters like cache and TLB sizes are defined by the host computer system. Simulators on the other hand are very flexible but are usually slow. At worst, they may force the use of reduced data sets or otherwise unrepresentative experiment setups that give misleading results.
There are a variety of methods to perform cache behavior studies. These include simulation, hardware monitoring, statistical methods and compile-time analysis. Compile-time analysis tools [35][8] estimate cache miss ratios by statically analyzing the code and determine when cache misses occur. Compile-time analysis major advantage is that it doesn't require the program to be executed, and can potentially be parameterized in terms of workloads etc. Its drawback is that it is limited to relatively well-structured codes where for example loop limits are known at compile time.
Cache simulators may be driven by instrumented code [13, 14, 20, 21, 23, 26, 27], on source code [17] or machine code levels, or the cache simulator incorporated in a full system simulator [24][22]. Their major limitation is their large slowdown. Simulation-based analysis can possibly combined with sampling (see below) to reduce the runtime overhead.
Cache-sampling techniques include set sampling and time sampling. In time sampling a cache model simulates continuous sub-traces from the complete access trace.
This is explored in papers [11, 15, 18, 19, 36]. It works well for smaller caches, but the need for long warm-up periods makes time sampling less suitable for large caches. The problem of selecting statistically representative samples is explored in Perelman et al.[32] Set sampling is another approach, were only a fraction of the sets in a set-associative cache is simulated [11, 18]. It generally suffers from poor accuracy and can only be used as a rough estimate.
More recently, sampling guided by phase detection has been proposed [31, 37]. The idea is based on the observation that most applications have different phases during their execution. Within each phase, the system performs in a fairly invariant (often repetitive) way. Guided by phase detection algorithms, very sparse samples can still provide a representative behavior for the entire execution. While most work on phase-guided sampling has been targeting detailed pipeline simulation, similar techniques could also be applied to memory system modeling. Cutting down the number of samples for time-sampling of caches could turn out to be especially valuable, since the need to warm the large caches requires so many memory operations per sample. Phase detection could also work well together with our tool. Phase detection could guide us to sample more or less often during the execution which could cut back on out runtime overhead further. The fact that we do not need to warm the caches before our model is valid further speaks in our favor.
Hardware counters are available on most modern computers. Events that can be counted include L1 and L2 cache misses, coherence misses and number of stall cycles. Examples of use include DCPI [1], which uses an advanced hardware support to collect detailed information to the programmer, PAPI [6] which is a common programming interface to access hardware monitoring aids, and several commercial tools [12, 16]. Histogramming and tracing hardware may be used to detect for example cache conflicts [30] and locate problem areas [7]. Their limitations are mainly that only architectural parameters realized on the hardware may be studied, and that it can be hard to capture entities not directly present in the hardware, such as spatial locality. Trap-driven trace generation has also been suggested [34]. It can trace unmodified code, but requires OS modification.
Other approaches to describe and quantify memory behavior include the concept of data streams or strides. Information about data streams can be used to guide prefetching [9][10] and help choose between optimizations such as tiling, prefetching and padding[29]. Abstract cross-platform models for analyzing and visualizing cache behavior exist [25, 5, 38], mostly based on a reuse distance definition similar to the stack distance [28].    [1] J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems, 1997.    [2] E. Berg and E. Hagersten. SIP: Tuning through Source Code Interdependence. In Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), pages 177-186, Paderborn, Germany, August 2002.    [3] E. Berg and E. Hagersten. StatCache: Low-Overhead Spatial and Temporal Data Locality Analysis Technical report 2003-57, Department of information technology, Uppsala University, Sweden, 2003.    [4] E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of International Symposium on Analysis of Systems And Software, 2004.    [5] K. Beyls, E. D'Hollander, and Y. Yu. Visualization enables the programmer to reduce cache misses. In Proceedings of Conference on Parallel and Distributed Computing and Systems, 2002.    [6] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application tuning using hardware counters. In Proceedings of SuperComputing, 2000.    [7] B. Buck and J. Hollingsworth. Using hardware monitors to isolate memory bottlenecks. In Proceedings of Supercomputing, 2000.    [8] C. Cascaval and D. A. Padua. Estimating cache misses and locality using stack distances. In Proceedings of International Conference on Supercomputing, 2003.    [9] T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In SIGPLAN Conference on Programming Language Design and Implementation, pages 191-202, 2001.    [10] T. M. Chilimbi. Dynamic hot data stream prefetching for general-purpose programs. In PLDI, 2002.    [11] T. M. Conte, M. A. Hirsch, and W. W. Hwu. Combining trace sampling with single pass methods for efficient cache simulation. IEEE Transactions on Computers, 47(6):714-720, 1998.    [12] Intel Corporation. Intel VTune Analyzers http://www.intel.com/software/products/vtune/.    [13] L. DeRose, K. Ekanadham, and J. K. Hollingsworth. Sigma: A simulator infrastructure to guide memory analysis. In Proceedings of SuperComputing, 2002.    [14] A. Eustace and A. Srivastava. ATOM: A flexible interface for building high program analysis tools. In USENIX Winter, pages 303-314, 1995.    [15] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems, 21(4):703-746, 1999.    [16] M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In Proceedings of Supercomputing, 2003.    [17] R. Fowler J. Mellor-Crummey and D. Whalley. Tools for application-oriented tuning. In Proceedings of the 2001 ACM International Conference on Supercomputing, 2001.    [18] R. E. Kessler, M. D. Hill, and D. A. Wood. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Transactions on Computers, 43(6):664-675, 1994.    [19] S. Laha, J. A. Patel, and R. K. Iyer. Accurate low-cost methods for evaluation of cache memory systems. IEEE Transactions on computers, 1988.    [20] J. R. Larus and E. Schnarr. EEL: Machine-independent executable editing. In SIGPLAN Conference on Programming Language Design and Implementation, pages 291-300, 1995.    [21] A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10): 15-26, 1994.    [22] S. Devine M. Rosenblum, E. Bugnion and S. Herrod. Using the simos machine simulator to study complex systems. ACM Transactions on Modelling and Computer Simulation, 7:78-103, 1997.    [23] J. Maebe, M. Ronsse, and K. De Bosschere. DIOTA: Dynamic instrumentation, optimization and transformation of applications. In Compendium of Workshops and Tutorials. Held in conjunction with International Conference on Parallel Architectures and Compilation Techniques, September 2002.    [24] P. Magnusson, F. Larsson, A. Moestedt, B. Werner, F. Dahlgren, M. Karlsson, F. Lundholm, J. Nilsson, P. Stenström, and H. Grahn. SimICS/sun4m: A virtual workstation. In Proceedings of the Usenix Annual Technical Conference, pages 119-130, 1998.    [25] G. Marin and J. Mellor-Crummey. Cross-architecture predictions for scientific applications using parameterized models. In Proceedings of Joint International Conference on Measurement and Modeling of Computer Systems, pages 2-13, New York, N.Y., June 2004.    [26] M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of International Conference on Modeling of Computer Systems, pages 1-12, 1992.    [27] M. Martonosi, A. Gupta, and T. E. Anderson. Tuning memory of sequential and parallel programs. IEEE Computer, 28(4):3240, 1995.    [28] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78-117, 1970.    [29] T. Mohan, B. R. de Supinski, S. A. McKee, F. Mueller, A. Yoo, and M. Schultz. Identifying and exploiting spatial regularity in data memory access. In Proceedings of Supercomputing, 2003.    [30] L. Noordergraaf and R. Zak. Smp system interconnect instrumentation for analysis. In Proceedings of Supercomputing, 2002.    [31] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using SimPoint for accurate and efficient simulation. In Proceedings of SIGMETRICS, 2003.    [32] E. Perelman, G. Hamerly, and B. Calder. Picking statistically valid and early simulation points. In Proceedings of Parallel Architectures and Compilation Techniques, 2003.    [33] SPEC. Standard evaluation corporation http://www.spec.org/.    [34] R. Uhlig, D. Nagle, T. N. Mudge, and S. Sechrest. Trap-driven simulation with tapeworm II. In Proceedings of Architectural Support for Programming Languages and Operating Systems, pages 132-144, 1994.    [35] X. Vera and J. Xue. Let's study whole-program cache behaviour analytically. In Proceedings of 8th International Symposium on High-Computer Architecture, 2002.    [36] D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. ACM SIGMETRICS Evaluation Review, 19(1), May 21-24, 1991.    [37] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of International Symposium of Computer Architecture, 2003.    [38] Y. Zhong, S. G. Dropsho, and C. Ding. Miss rate prediction across all program inputs. In Proceedings of Parallel Architechtures and Compilation Techniques, 2003.