1) Field of the Invention
The present invention relates to a technique for measuring the number of events having occurred in a processor or a computer system, including various components such as a processor and a memory controller, in order to measure performance of the processor or the computer system, or tune the performance of the same.
2) Description of the Related Art
When the processing speed of a computer is increased, it is essential to grasp behavior of the computer at the time of program execution.
In order to examine in detail behavior of the computer, there is widely used a technique of measuring the number of various events having occurred inside the computer. For this purpose, an internal counter for measuring the performance is often equipped in the modern computer.
By using the counter for measuring the performance, it is possible to obtain the number of clocks for program execution, the number of accesses to the memory during execution of the program, and the like, for example. These kinds of behavior are all handled as “events”, and the number of events having occurred is recorded in such a manner as to increment a counter (a count value) by one each time an event occurs. More concretely, when the number of execution clocks is measured in the former example, “elapse of one clock” is handled as one event, whereby an event occurs each time one clock elapses, and the counter is incremented by one each time an event occurs. Events are various, which cover all kinds of behavior necessary to measure the performance such as execution of a branch instruction, cache miss, memory read, memory write and the like, for example.
A recent computer system is often equipped with a counter in each part to be able to measure the performance. For example, Pentium Pro, which is a processor of Intel, is equipped with two counters for measuring the performance, and the processor can separately set (1) selection of events, (2) reading of a value, (3) setting of a value, (4) clearing of a value, (5) interruption at the time of overflow, and the like to each of the counters. Further, a similar counter is provided to the memory controller to be able to measure events in the vicinity of the memory bus.
With such a counter, it is possible not only to see a total value of the number of all events, but also to see fluctuation with time of the number of the events having occurred by sampling a measured value of the counter at each predetermined time. It is thereby possible to obtain various information about whether the number of events having occurred is stable or not, about a degree of fluctuation in the number of events having occurred, about whether the occurrence of events concentrates in processes in a specific time zone or not, and the like. It is further possible to use these kinds of information for not only merely measurement of the performance but also tuning of the system.
The above performance measuring counter can be used to not only measure a result of execution of a program but also optimize a dynamic program. Namely, there is applied a method in which the OS (Operating System) or the program refers a value of the counter during execution of the program so as to vary the operation of the program, thereby dynamically optimizing the program.
Considering here that memory accesses are dynamically optimized by an OS in a parallel computer of a distributed shared storage type as shown in, for example, FIG. 14. A parallel computer of a distributed shared storage type is configured by communicably connecting a plurality of nodes 100 with one another over an interconnection network 200. Each of the nodes 100 comprises a CPU (Central Processing Unit) 101, a memory 102, a memory controller 103 and a network interface 104.
The CPU 101 of each of the nodes 100 can freely access to not only the memory 102 in the node 100 to which the CPU 101 itself belongs but also a memory in another node 100. When seen from a certain CPU, there are a memory 102 to which the CPU can access at a high speed and a memory 102 that the CPU takes a longer time to access to it. Assuming here that the memory controller 103 in each of the nodes 100 functions as the above performance measuring counter (not shown) in the above parallel computer.
The OS assigns the memory 102 to a program when the program issues a memory request. When assigning the memory 102, the OS cannot know how the memory 102 is used. For this, the OS cannot determine whether the memory 102 belonging to a node 100 of the same CPU 101 or the memory 102 belonging to another node 100 should be assigned. It is generally desirable to assign the memory 102 in the same node 100, but it is not in some cases.
The two performance measuring counters 1 and 2 count the following events, respectively:
Counter 1: incrementing the count value by one when the CPU 101 in the same node 100 accesses;
Counter 2: incrementing the count value by one when the CPU 101 in another node 100 accesses.
A program is executed in this state, and a value of the counter is checked at each predetermined time. When the count value of the counter 1 is close to zero, and the count value of the counter 2 is sufficiently large, it is known that the CPU 101 requiring the memory 102 belongs to another node 100. In such case, the OS changes the assignment of the memory 102, and moves data to close to the CPU 101 that actually requires the memory 102. When seen from the CPU 101, most of the data required by the CPU 101 is on the memory 102 to which the CPU 101 can access at a high speed, so that the speed of execution of the program is increased.
In a computer such as Origin 2000 of SGI (Sillicon Graphics Inc.), a counter for totaling memory accesses is equipped beside the memory for this purpose. This counter is provided for each page to measure the number of accesses for each page by referring to it at each predetermined time from the OS.
In the above known measuring method, a quantity of count values obtained by the counter becomes enormous when the sampling is performed for a long time or when the number of events having occurred in each memory block is measured over the entire memory space. For this, the known measuring method can be adapted to a large-sized computer allowing a sufficient cost, but cannot be adapted to a small-sized computer requiring a reduction in cost as much as possible, which needs to shift a small number of counters to collect data.
Considering the above Origin 2000 of SGI, for example. Origin 2000 has a reference counter for 64 nodes per page. Assuming that a page size is 4 KB, and one retaining area (this area being sometimes referred as a counter) for count values obtained by the counter is 32 bits, the counter requires a memory size of one-one hundred twenty eighth of the entire memory. This cannot be accommodated inside the LSI, thus has to be realized as an external memory. Securing an area for the counter as this causes not only an increase in physical quantity of hardware, but also a problem that the design of the hardware becomes complex. Particularly, this method cannot be accomplished in a relatively small-sized computer in which the cost is given importance.
In the case of measurement using a counter of a small capacity, such a method is heretofore employed that, when count values that are an object of measurement exceed the capacity of the counter, the count values are temporarily written onto the memory from the retaining area, or a count value obtained by the counter is overwritten in order in the retaining area.
In the former method, overhead to write count values onto the memory is present, and a control using software or complex hardware is necessary. In the performance measurement, it is desirable to avoid unnecessary overhead in order to decrease measurement errors as much as possible. The former method has a disadvantage in this point.
In the latter method, count values are simply overwritten in order, so that only a count value measured at a time very close to when the counter is observed is left in the counter (retaining area). For this, information by which determination can be made on the whole performance (a significant count value necessary to determine the performance) cannot be obtained.
Japanese Laid-Open Publication No. 10-260869 discloses a method for efficiently managing a performance counter which samples performance of a processor using a hash table. However, this method cannot decrease a physical quantity of the counter although management of the performance counter becomes efficient, but requires hardware necessary for the management, which cannot be applied to a relatively small-sized computer in which the cost is given importance.