1. Field of the Present Invention
The present invention generally relates to the field of computer systems and more particularly to the monitoring of memory performance in a non-uniform memory architecture system.
2. History of Related Art
The use of multiple processors to improve the performance of a computer system is well known. In a typical arrangement, a plurality of processors are coupled to a system memory via a common bus referred to herein as the system or local bus. The use of a single bus ultimately limits the ability to improve performance by adding additional processors because, after a certain point, the limiting factor in the performance of a multiprocessor system is the bandwidth of the system bus. Generally speaking, the system bus bandwidth is typically saturated after a relatively small number of processors have been attached to the bus. Incorporating additional processors beyond this number generally results in little if any performance improvement.
Distributed memory systems have been proposed and implemented to combat the bandwidth limitations of single bus systems. In a distributed memory system, two or more single bus systems referred to as nodes are connected to form a larger system. Each node typically includes its own local memory. One example of a distributed memory system is referred to as a non-uniform memory architecture (NUMA) system. A NUMA system is comprised of multiple nodes, each of which may include its own processors, local memory, and corresponding system bus. The memory of each node is accessible to each other node via a high speed interconnect network that links the various nodes. The use of multiple system busses (one for each node) enables NUMA systems to employ additional processors without incurring the system bus bandwidth limitation experienced by single bus systems. Thus, NUMA systems are more suitably adapted for scaling than conventional systems.
In a NUMA system, the time required to access system memory is a function of the memory address because accessing memory local to a node is faster than accessing memory residing on a remote node. In contrast, access time is essentially independent of the memory address in conventional SMP designs. Software optimized for use on conventional machines may perform inefficiently on a NUMA system if the software generates a large percentage of remote memory accesses when executed on the NUMA system. The potential for performance improvement offered by scaleable NUMA systems may be partially offset or entirely negated if, for example, the paging scheme employed by the NUMA system allocates a code segment of the software to the physical memory of one node and a data segment that is frequently accessed by the processors of another node. Due to variations in memory architecture implementation, paging mechanisms, caching policies, program behavior, etc., tuning or optimizing of any given NUMA system is most efficiently achieved with empirically gathered memory transaction data. Accordingly, mechanisms designed to monitor memory transactions in NUMA systems are of considerable interest to the designers and manufacturers of such systems.
Accordingly, it is an objective of the present invention to provide a performance monitor configured to count and categorize memory transactions in a computer system. In one embodiment, the monitor is connected directly to the computer system""s interconnect network. In an alternative embodiment, the monitor may be connected to the system bus of a node on the computer system. The monitor may be suitably implemented with commercially available programmable gate arrays and packaged as a circuit board that includes connector sockets suitable for permitting the monitor to tap into the interconnect network. In an embodiment in which the monitor is coupled to the interconnect network, the monitor may include a I/O interface for communicating with the computer system via a standard I/O bus such as a PCI bus. In an embodiment in which the monitor resides on a system bus, direct communication with the computer may be achieved via the system bus thereby eliminating the need for an I/O bus interface.
Broadly speaking, a first application of the invention emphasizing the ability to separately monitor concurrently executing programs contemplates a computer system comprised of a local node including at least one processor coupled to its local memory via a local bus of the local node. A remote node of the system includes at least one processor coupled to a memory local to the remote node via a local bus of the remote node. An interconnect network couples the remote node to the local node such that the processor of the local node can access memory local to the remote node and the processor of the remote node can access memory local to the local node. The system further includes a performance monitor including an interface coupled to the interconnect network and configured to extract, at a minimum, physical address information from a transaction traversing the interconnect network, a filter module adapted for associating the physical address with one of multiple memory blocks, and an address mapping module configured to associate the appropriate memory block with one or more access counters. The performance monitor is preferably configured such that each access counter is associated with a memory region owned by a program thereby providing means for counting memory transactions associated with the program.
The first application of the invention further contemplates a performance monitor that includes an interface, a filter module, and an address mapping module. The interface is suitable for coupling to an interconnect network of a computer system or to a system bus of a node within the computer system depending upon the location of the monitor. The interconnect network links a local node of the system with at least one remote node. The interface is configured to extract, at a minimum, physical address information from a transaction traversing the network or bus to which the monitor is coupled. In addition to physical address information, other pertinent information such as transaction type information and node identification information may be contained in and extracted from the transaction. The filter module associates the physical address with one of several memory blocks, where each memory block corresponds to a contiguous portion of the system""s physical address space. The address mapping module associates the identified memory block with one or more access counters and increments each of the associated access counters where each access counter corresponds to one of multiple concurrently executing programs. The association between the selected memory block and the access counters is facilitated by a pointer field corresponding to each memory block.
In one embodiment of the performance monitor, the interface unit may be configured, such as by the appropriate setting of a direction selection bit in a performance monitor status register, to selectively monitor either incoming or outgoing transactions. In another suitable arrangement, the monitor is configured to monitor both incoming and outgoing transactions simultaneously. In one embodiment, the filter module includes a stage comprised of multiple region filters that are adapted to receive pertinent transaction information including the transaction""s physical address information. Typically, each of the region filters is associated with a contiguous region of the system""s physical address space. In response to receiving the pertinent information, each of the region filters output a signal that indicates whether the transaction fulfills a set of criteria corresponding to the filter. The pertinent information may include, for example, transaction type information and node identification information in addition to the transaction""s physical address information. Correspondingly, the criteria for each filter may include transaction type criteria and node identification criteria as well as physical address criteria. In one embodiment, each region filter includes a match register and a mask register that cooperatively define the criteria corresponding to the filter. The programming of the region filter registers is preferably achieved via a programming interface that couples the registers of the performance monitor to a communication bus. In embodiments in which the monitor resides on the system""s interconnect network, a standard I/O bus such as a PCI bus may be employed as the communication bus while, in embodiments in which the monitor resides on the system bus, the system bus itself may suitably provide the means for communication with the monitor.
In the preferred embodiment, each memory region is further divided into one or more of the memory blocks. In this embodiment, region descriptors including a block number field indicating the number of the memory blocks in the region and a block size field indicating the size of each memory block are utilized. Each memory block is associated with a corresponding block counter adapted to increment if the transaction attributes (i.e., address, type, node id) match the corresponding region filter criteria and the transaction""s address lies within the memory block corresponding to the block counter. In the preferred embodiment, each memory block is associated with a pointer field. The contents of the pointer field identify one or more access counters that are associated with the memory block. When a memory block counter is incremented, the address mapping module utilizes the pointer field to increment the access counter(s) associated with the memory block. The pointer fields and access counters provide a mechanism for accumulating transaction information from discontiguous physical memory regions into a single counter thereby providing means for counting transactions associated with a particular virtual memory space. In an embodiment suitable for signaling the system upon the occurrence of certain specified conditions, the performance monitor may suitably include an interrupt unit configured, in conjunction with an interrupt mask of the region descriptor, to issue a hardware interrupt if any of the memory block counters in the region reaches a threshold value.
The first application of the present invention still further contemplates a method of monitoring performance of a computer system. One or more programs are executed on a computer system that includes two or more nodes (including at least a local node and a remote node) coupled together via an interconnect network. Physical address information is then extracted from transactions traversing the interconnect network and associated with one of the memory blocks based upon predefined memory block boundaries. The identified memory block is then associated with at least one of a plurality of access counters. The appropriate access counters are then incremented. Preferably, the step of defining the memory blocks includes defining boundaries for a plurality of physical address regions and further defining the number and size of multiple memory blocks within each of the regions. In one embodiment, the method includes a step in which incoming or outgoing transactions are selected for monitoring, preferably by setting an appropriate bit in a status register of the performance monitor. The associating of the selected memory block with the access counters preferably includes interpreting a pointer field corresponding to the memory block, where the pointer field indicates which of the access counters are associated with the memory block. In one embodiment, the method further includes issuing an interrupt if any of the access counters exceeds a specified threshold value.
A second application of the present invention contemplates a performance monitor configured to count memory transactions and to issue an interrupt to the computer system if the monitor detects a specified number of transactions associated with a particular segment of the physical address space of the system. This embodiment of the invention includes an interface suitable for coupling to an interconnect network of a computer system and configured to extract, at a minimum, physical address information from a transaction traversing the interconnect network. The monitor further includes a filter module adapted for associating the extracted physical address with one of a plurality of memory blocks and, in response thereto, incrementing a memory block counter corresponding to the memory block. An interrupt unit of the monitor is configured to assert an interrupt if the block counter exceeds a predetermined value. In the same manner as the application of the invention discussed above, one embodiment of the interface unit is configurable to selectively monitor either incoming or outgoing transactions and the translation unit preferably includes a plurality of region filters each comprising one or more of the memory blocks. In the preferred embodiment, the plurality of block counters are implemented with an array of random access memory device such as an array of static RAMs. Each of the block counters is associated with a programmable interrupt disable bit operable to prevent the interrupt unit from asserting an interrupt corresponding to the associated block counter.
In one embodiment useful for simulating operation of the system and for checking the design of the performance monitor, the monitor further includes a transaction generator coupled to the interconnect network and designed to issue specified remote memory transactions at specified intervals if the transaction generator is enabled. In one embodiment, the enabling of the transaction generator and the performance monitor are controlled by a common bit such that the transaction generator is enabled whenever the performance monitor is disabled. In one embodiment, the transaction generator is configurable to issue either incoming or outgoing transactions.
The second application of the present invention still further contemplates a computer system that includes a local node, at least one remote node, an interconnect network coupling the remote node to the local node, and a performance monitor. The performance monitor includes an interface unit configured to extract, at a minimum, physical address information from transactions on the interconnect network and a filter module that is designed to associate the transaction""s physical address with one of a plurality of memory blocks and increment a block counter corresponding to the memory block. The monitor further includes an interrupt unit configured to assert an interrupt if the block counter exceeds a predetermined value. The filter module preferably includes a plurality of region filters that are adapted to receive a transaction""s physical address information. Each of the region filters is associated with a memory region and each memory region is comprised of one or more of the memory blocks. The size and number of memory blocks within a given region is programmably alterable in the preferred embodiment. In one embodiment, the computer system is configured to respond to the interrupt by subdividing the memory blocks of the region associated with the interrupt into smaller memory blocks prior to obtaining additional performance monitor data thereby providing means for gathering increasingly detailed information about increasingly smaller portions of the physical address space. In another embodiment emphasizing dynamic performance improvement, the computer system operating software is configured to respond to the interrupt by migrating the contents of the memory block responsible for triggering the interrupt to physical address space located on a different node in an effort to find a physical home for the memory block contents that produces a minimum number of remote access. The system may further include a transaction generator coupled to the interconnect network and operable to issue specified interconnect transactions at specified intervals if the performance monitor is enabled.
The second application of the present invention still further contemplates a method of monitoring performance of a computer system in which, initially, physical address boundaries are defined for a plurality of memory blocks. Physical address information is then extracted from transactions traversing an interconnect network of the computer system. The physical address is then associated with one of the memory blocks and a memory block counter corresponding to memory block is then incremented. An interrupt is then asserted if the block counter exceeds a specified value. In a presently preferred embodiment, the step of defining the memory blocks includes defining one or more memory regions by programming one or more base address fields of corresponding region descriptors and dividing the memory region into the memory blocks by programming block sizes and block counts for each of the region descriptors. In one embodiment, the contents of the memory block responsible for the interrupt are migrated to a different node in response to the interrupt. In another embodiment, the memory block responsible for the interrupt is subdivided into smaller memory blocks in response to the interrupt and prior to performing additional monitoring.