1. Field of the Invention
The present invention relates in general to an improved data processing system and method and in particular to an improved data processing system and method for monitoring the performance of a data processing system. Still more particularly, the present invention relates to an improved method and system for measuring the latency of data requests within a data processing system.
2. Description of the Related Art
It is well-known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processors in tandem. Multi-processor (MP) computer systems can be designed with a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One of the most common MP computer topologies is a symmetric multi-processor (SMP) configuration in which multiple processors share common resources, such as a system memory and input/output (I/O) subsystem, which are typically coupled to a shared system interconnect. Such computer systems are said to be symmetric because all processors in a SMP computer system ideally have the same access latency with respect to data stored in the shared system memory.
Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability. As a result, an MP computer system topology known as non-uniform memory access (NUMA) has emerged as an alternative design that addresses many of the limitations of SMP computer systems at the expense of some additional complexity. A typical NUMA computer system includes a number of interconnected nodes that each include one or more processors and a local "system" memory.
Such computer systems are said to have a non-uniform memory access because each processor has lower access latency with respect to data stored in the system memory at its local node than with respect to data stored in the system memory at a remote node. NUMA systems can be further classified as either non-coherent or cache coherent, depending upon whether or not data coherency is maintained between caches in different nodes.
The complexity of cache coherent NUMA (CC-NUMA) system is attributable in large measure to the additional communication required for hardware to maintain data coherency not only between the various levels of cache memory and system memory within each node but also between cache and system memories in different nodes. NUMA computer systems do; however, address the scalability limitation of conventional SMP computer systems, since each node within a NUMA computer system can be implemented as a smaller SMP system. Thus, the shared components within each node can be optimized for use by only a few processors, while the overall system benefits from the availability of larger scale parallelism while maintaining relatively low latency.
Measuring the latency associated with data requests within data processing systems, such as an MP computer system, is important to monitor the performance of a data processing system. However, gathering performance data for MP computer systems, such as NUMA computer systems, is typically limited because a data request may be satisfied by one of several levels of cache or else by memory. In addition, frequently, data requests pass through multiple chips within a NUMA computer system before data is returned. Moreover, logical partitioning of processors and data storage areas of data processing systems complicates measuring performance. Even within a single processor system, a processor can access data from multiple levels of cache and memory where measuring the latency of each data request to each level is difficult Moreover, it is typically not stipulated that two data requests to the same level of cache will encounter the same set of conditions for each request, where conditions along a data path to capture requested data typically affect the latency of the data path.
There is a need for a system wide performance method which will provide an overview of the data processing system performance for single and multiprocessor data processing systems which access data from a memory hierarchy. Further, there is a need to measure the performance of particular data request events within a data processing system.