In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises one or more central processing units (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communication buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU or CPUs are the heart of the system. They execute the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Sophisticated software at multiple levels directs a computer to perform massive numbers of these simple operations, enabling the computer to perform complex tasks. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster, and thereby enabling the use of software having enhanced function. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Enormous improvements in clock speed have been made possible by reduction in component size and integrated circuitry, to the point where an entire processor, and in some cases multiple processors along with auxiliary structures such as cache memories, can be implemented on a single integrated circuit chip. Despite these improvements in speed, the demand for ever faster computer systems has continued, a demand which can not be met solely by further reduction in component size and consequent increases in clock speed. Attention has therefore been directed to other approaches for further improvements in throughput of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical. However, one does not simply double a system's throughput by going from one processor to two. The introduction of multiple processors to a system creates numerous architectural problems. For example, the multiple processors will typically share the same main memory (although each processor may have its own cache). It is therefore necessary to devise mechanisms that avoid memory access conflicts, and assure that extra copies of data in caches are tracked in a coherent fashion. Furthermore, each processor puts additional demands on the other components of the system such as storage, I/O, memory, and particularly, the communications buses that connect various components. As more processors are introduced, these architectural issues become increasingly complex, scalability becomes more difficult, and there is greater likelihood that processors will spend significant time waiting for some resource being used by another processor.
All of these issues and more are known by system designers, and have been addressed in one form or another. While perfect solutions are not available, improvements in this field continue to be made.
One architectural approach that has gained some favor in recent years is the design of computer systems having discrete nodes of processors and associated local portions of addressable main memory, also known as distributed shared memory computer systems or non-uniform memory access (NUMA) computer systems. In a conventional symmetrical multi-processor (SMP) system, addressable main memory is designed as a single large data storage entity, which is equally accessible to all CPUs in the system. As the number of CPUs increases, there are greater bottlenecks in the buses and accessing mechanisms to such main memory. A NUMA system confronts this problem by dividing addressable main memory into subsets of discrete address ranges, each of which is physically associated with a respective CPU, or more typically, a respective group of CPUs. A subset of memory and associated CPUs and associated local hardware is sometimes called a “node”. A node typically has an internal memory bus providing direct access from a CPU to the memory subset (“local memory”) within the node. Indirect mechanisms, which are slower, exist to access portions of main memory located across node boundaries. Thus, while any CPU can still access any arbitrary addressable memory location, a CPU can access addresses in its own node faster than it can access addresses outside its node (hence, the term “non-uniform memory access”). By limiting the number of devices on the internal memory bus of a node, bus arbitration mechanisms and bus traffic can be held to manageable levels even in a system having a large number of CPUs, since most of these CPUs will be in different nodes. From a hardware standpoint, this means that a NUMA system architecture has the potential advantage of increased scalability.
A NUMA system provides inter-node access so that it has a single logical main memory, each location having a unique address. But inter-node access is relatively slow and burdensome of certain system resources. In order for a NUMA system to work efficiently, the data required by a task executing on a CPU should generally be stored in the local memory of the same node as the CPU.
It is impractical to guarantee that this will always be the case, but NUMA systems generally have mechanisms which make it likely that data required by a CPU will be stored in the local memory. For example, dispatching mechanisms which always or preferentially dispatch tasks to consistent nodes for execution, and paging mechanisms which always or preferentially load memory pages from storage to the local node which caused the page fault, will tend achieve this result. It can therefore be assumed that in a typical NUMA system environment, most of the data required by a thread executing on a processor will be available in the local node of the processor.
Many complex computer systems use some form of monitoring software for monitoring the performance of the system. Monitoring may take the form of executing one or more monitoring thread which collects data, while the computer system executes other threads on behalf of user applications to perform useful work; this collected data is sometimes analyzed in real time, i.e., during the collection, but often analyzed subsequently. Some monitoring threads may be used to monitor individual threads executing on behalf of user applications (“monitored threads”). Such monitoring threads may need to access state data of the monitored threads to collect the necessary data.
If a monitoring thread needs to collect state data from multiple monitored threads in a NUMA system, the monitored threads may be executing on different processors in different nodes (or be in wait states, having recently executed on different processors in different nodes). In this case, it is likely that the state data needed for data collection by the monitoring thread will reside in different local memories of different nodes. It is therefore likely that the monitoring thread will perform a substantial number of inter-node data accesses to collect the required data. A substantially even or random distribution of monitored threads, which may be common in the case of some monitoring threads, makes monitoring data collection very inefficient in a typical NUMA system.
As programs grow in size and complexity, and NUMA systems also become larger, the task of collecting and analyzing performance data increases in difficulty. A need exists for improved techniques for collecting data from monitored programs, and in particular from programs being executed on computer systems employing a NUMA design.