The present invention relates to a load testing apparatus used for the load test or fault diagnosis of a parallel processor system, a computer readable recording medium for recording a load test program, a fault diagnosis apparatus, and a computer readable recording medium for recording a fault diagnosis program. More particularly, this invention relates to a load testing apparatus, a computer readable recording medium for recording a load test program, a fault diagnosis apparatus, and a computer readable recording medium for recording a fault diagnosis program, which can produce a highly reliable test result and identify a defective point with rapidity.
In the field of science and technology including the atomic power, meteorology and aeronautics, a parallel processor system for arithmetically processing a vast quantity of data far exceeding the data processing capacity of a general-purpose mainframe computer is required. The parallel processor system is generally called the supercomputer, in which an ultrahigh speed arithmetic operation is realized by parallel processing of a plurality of processor elements interconnected through an inter-processor network (such as a crossbar network unit). The parallel processor system requires a specification capable of exhibiting at least a predetermined level of performance even in the state of high utilization rate of a CPU (Central Processing Unit), i.e. under a heavy load. Therefore, a load testing apparatus for checking the performance under heavy load is indispensable for designing, development and performance evaluation of the parallel processor system. Also, the parallel processor system is required to have means and a method of identifying a defective point rapidly in case of a fault.
FIG. 32A is a block diagram showing a configuration of the conventional parallel processor system described above. A crossbar network unit 1 and five processor elements PE0 to PE4 making up the parallel processor system are shown in FIG. 32A. The processor elements PE0 to PE4 are arithmetic elements for executing the parallel computation in accordance with a parallel algorithm, and each include a transmission unit and a receiving unit (not shown) for transmitting and receiving packets (data), respectively. The crossbar network unit 1 is for interconnecting the processor elements PE0 to PE4 and includes a group of Nxc3x97N (5xc3x975 in the shown case) crossbar switches (not shown). The incoming line side of the crossbar network unit 1 is connected to the transmission unit (not shown) of the processor elements PE0 to PE4, respectively, and the outgoing line side thereof is connected to the receiving unit (not shown) of the processor elements PE0 to PE4, respectively.
For the parallel processor system described above, a load test is conducted for checking the performance under load. In the load test, packets are transmitted from a predetermined processor element of a source to a processor element of a destination and thereby a pseudo-load is generated, and the performance is evaluated based on the comparison between the packet transmission time (measurement) and an expected value theoretically determined.
Specifically, first, a plurality of sets (pairs) of the processor elements PE0 to PE4 are determined by being extracted at random as shown in FIG. 32A. In the example shown in FIG. 32A and FIG. 32B, the following sets 1A to 5A are determined.
(1A) Processor element PE0 and processor element PE1 
(2A) Processor element PE1 and processor element PE0 
(3A) Processor element PE2 and processor element PE3 
(4A) Processor element PE3 and processor element PE2 
(5A) Processor element PE4 and processor element PE4
The next step in the load test is to transmit packets from the processor elements PE0 to PE4 of the source in 1A to 5A above to the corresponding processor elements PE1 to PE4, respectively, of the destination at a time. As a result, the packets are exchanged by the crossbar network unit 1, and received by the processor elements PE1 to PE4 of the destination. In the process, the packet transmission time between each set of the processor elements is measured. In the case under consideration, a total of five measurements (transmission time) corresponding to 1A to 5A are obtained. These transmission time are compared with an expected value theoretically determined, and the performance of the parallel processor system is evaluated based on whether the difference between the transmission time and the expected value is in a tolerable range.
The expected value is a theoretical value of the transmission time which is expected to take for the packets to be transmitted between the processor elements in actual arithmetic operation. This expected value is a constant value of the theoretical transmission time plus a margin. The theoretical transmission time is the one between the processor elements which enables the parallel processor system to exhibit the maximum performance, and is calculated by a technique such as a simulation. The margin, on the other hand, is a value for absorbing the difference in transmission time caused by the difference of the physical distance between different sets of the processor elements described above.
The load test of the parallel processor system is desirably conducted under as heavy a condition as possible in order to assure proper evaluation of the performance under severe operating conditions. In the prior art, however, the processor elements PE0 to PE4 of the sources and destinations are combined at random as shown in FIG. 32A, and therefore, it is sometimes impossible to conduct the load test under heavy condition as shown in FIG. 32B, thereby leading to the disadvantage that the reliability of the test result is low.
Specifically, in the case shown in FIG. 32A, the processor elements of the source and the processor elements of the destination are combined in one-to-one relation, and packets are sent at the same time from all the source processor elements. Thus, the load test under heavy load can be conducted.
In the sets shown in FIG. 32B, on the other hand, a receiving interference is caused in the processor element PE3, and therefore the load is reduced. Specifically, FIG. 32B illustrates a combination for packet transmission in which two processor elements PE2 and PE4 of the source send packets to one processor element PE3 of the destination. In this combination, the two packets, which are sent from the processor elements PE2 and PE4 of the source, arrive at the single processor element PE3 through the crossbar network unit 1. In the process, the processor element PE3 of the destination which can receive only one packet at a time develops a receiving interference in which the two packets compete with each other.
Actually, however, the chance of the two packets arriving at the processor element PE3 at the same time point is very slim due to the difference in transmission time. As a result, while the first arriving one of the two packets is received by the processor element PE3, the other packet stands by. The combination causing this receiving interference, as compared with the sets shown in FIG. 32A, reduces the load and therefore a reliable test result cannot be obtained.
Also, in the conventional load test, an expected value (theoretical value) including a margin is applied uniformly to all the transmission time (measurements) between a plurality of sets of the process or elements, as described above. Actually, however, due to the difference in physical distance described above, the transmission time (measurement) is varied from one processor element set to another. In view of the fact that a predetermined expected value is used for varied transmission time, the conventional load test may produce a test result different from the reality, and therefore has the disadvantage of low reliability.
On the other hand, the conventional parallel processor system requires identification of a defective point based on the phenomenon presented at the time of a fault in which a packet is not sent from a processor element of the source or a packet sent from a processor element of the source fails to be received by a corresponding processor element of the destination. In the conventional parallel processor system, the configuration is complicated with the increase in the number of the processor elements involved, and the number of points to be checked increases to such an extent that a vast amount of labor and time are required before successfully identifying a defective point. Especially in the case of a fault of the crossbar network unit 1, a vast number of crossbar switches are required to be checked one by one and the workload required makes the identification of a defective point very difficult.
Further, in the case where a fault occurs in a processor element of the source, the address of a packet may change and therefore the particular packet may be sent erroneously to an entirely different destination. In such a case, the destination processor element which should otherwise receive the particular packet cannot receive it, and therefore detects a fault as a time out for receiving. On the other hand, the destination processor element that has received the packet erroneously sent thereto also detects a fault. In contrast, the processor element of the source that has actually developed a fault is regarded to be in normal operation since it has sent out the packet anyway. In case of the secondary fault described above, it is more difficult to identify a defective point.
It is an object of the invention to provide a load testing apparatus, a computer readable recording medium for recording a load test program, a fault diagnosis apparatus, and a computer readable recording medium for recording a fault diagnosis program, which can produce a highly reliable test result and can identify a defective point with rapidity.
In order to achieve the object described above, according to one aspect of the present invention, the load testing apparatus comprises a transmission time measuring unit for measuring the transmission time between each set of arithmetic unit as an expected value based on the result of combining a plurality of arithmetic units accurately into a plurality of sets each including an arithmetic unit of the source and an arithmetic unit of the destination; a load test unit for sending packets from a plurality of arithmetic unit of the source to the corresponding arithmetic unit of the destination each constituting a set with the corresponding arithmetic unit of the source and measuring the transmission time between each set of arithmetic unit based on the result of accurate combination of the arithmetic unit on condition that no packet is sent from a plurality of arithmetic unit of the source to a single arithmetic unit of the destination; and a performance evaluation unit for evaluating the performance based on the result of comparing the transmission time of each set measured by the load testing unit with the corresponding expected value of each set.
According to the above invention, the transmission time between each set of arithmetic units is (actually) measured as an expected value by the transmission time measuring unit before the load test. In the load test, upon transmission of packets at a time from a plurality of sets of the arithmetic units of the source to the corresponding arithmetic units of the destination included in the sets, respectively, a plurality of packets are received by the arithmetic units of the destination, respectively, through a network. In the process, the packets are sent at a time on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination, and therefore a heavy load is imposed on the parallel processor system. Also, the load testing unit measures the transmission time between each set of the arithmetic units. Thus, the performance is evaluated by comparing the transmission time in each set measured by the load testing unit with the corresponding expected value for the particular set.
As described above, a load test can be conducted always under a heavy load in view of the fact that a plurality of packets are sent at a time on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination. Further, the performance is evaluated with the actual measurement of the transmission time of each set as an expected value, and therefore a highly reliable test result is obtained.
According to another aspect of the present invention, the load testing apparatus comprises a transmission time measuring unit for measuring the transmission time between each set of arithmetic units as an expected value based on the result of combining a plurality of arithmetic units accurately into a plurality of sets each including an arithmetic unit of the source and an arithmetic unit of the destination; a load test unit for sending packets from a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination each constituting a set with the corresponding arithmetic unit of the source in such a transmission timing that the packets arrive at the network at the same time and measuring the transmission time between each set of arithmetic units based on the result of accurate combination of the arithmetic units on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination; and a performance evaluation unit for evaluating the performance based on the result of comparing the transmission time of each set measured by the load testing unit with the corresponding expected value of each set.
According to the above invention, the transmission time between each set of arithmetic units is (actually) measured as an expected value by the transmission time measuring unit before the load test. In the load test, upon transmission of packets from a plurality of sets of the arithmetic units of the source to the corresponding arithmetic unit of the destination included in the sets in such a transmission timing that the packets arrive at the network at the same time, a plurality of packets arrive at the network at the same time. In the process, the packets are sent at a time on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination and the packets arrive at the network at the same time, and therefore a maximum load is imposed on the parallel processor system. Also, the load testing unit measures the transmission time between each set of the arithmetic unit. Thus, the performance is evaluated by comparing the transmission time of each set measured by the load testing unit with the corresponding expected value for the particular set.
As described above, a load test can be conducted always under a maximum load in view of the fact that a plurality of packets are sent on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination and that the packets arrives at the network at the same time. Further, the performance is evaluated based on the transmission time under maximum load with the actual measurement of the transmission time of each set as an expected value, and therefore a more highly reliable test result is obtained.
According to still another aspect of the present invention, the load testing apparatus comprises a transmission time measuring unit for measuring the transmission time between each set of arithmetic units as an expected value based on the result of combining a plurality of arithmetic units accurately into a plurality of sets each including an arithmetic unit of the source and an arithmetic unit of the destination; a load test unit for sending packets from an arithmetic unit of a specified set of the source longer in transmission time than the other sets of the source to the corresponding arithmetic unit of the destination while at the same time transmitting packets from a plurality of arithmetic units of the other sets to the corresponding arithmetic unit of the corresponding sets of the destination, respectively, at a time, and measuring the transmission time between each set of arithmetic unit including the specified set of arithmetic unit based on the result of accurate combination of the arithmetic unit on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination; and a performance evaluation unit for evaluating the performance based on the result of comparing the transmission time of the specific set of arithmetic unit and the transmission time of each other set measured by the load testing unit with the corresponding expected value of each set.
According to the above invention, the transmission time between each set of arithmetic units is (actually) measured as an expected value by the transmission time measuring unit before the load test. In the load test, a packet is sent from a specified set of arithmetic units of the source to the corresponding arithmetic unit of the destination while at the same time transmitting packets from a plurality of other sets of the arithmetic units of the source to the corresponding arithmetic unit of the destination, and then a plurality of packets are received by the corresponding arithmetic unit, respectively, of the destination through a network. Also, the load testing unit measures the transmission time between each set of the arithmetic units including the specified set of arithmetic units. Thus, the performance is evaluated based by comparing the transmission time of the specified set and each other set measured by the load testing unit with the corresponding expected value for each set.
As described above, while a packet is sent by a specified set of arithmetic units, packets are transmitted and the performance is evaluated by other than a specified set of arithmetic units, and therefore it is possible to determine the effect that the transmission of a packet by a specified set of arithmetic unit has on the transmission of packets by the other sets of arithmetic units.
According to still another aspect of the present invention, the load testing method comprises a transmission time measuring step of measuring the transmission time between each set of arithmetic units as an expected value based on the result of combining a plurality of arithmetic units accurately into a plurality of sets each including an arithmetic unit of the source and an arithmetic unit of the destination; a load test step of sending packets from a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination each constituting a set with the corresponding arithmetic unit of the source and measuring the transmission time between each set of arithmetic units based on the result of accurate combination of the arithmetic units on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination; and a performance evaluation step of evaluating the performance based on the result of comparing the transmission time of each set measured at the load test step with the corresponding expected value of each set.
According to the above invention, the transmission time between each set of arithmetic units is (actually) measured as an expected value in the transmission time measuring step before the load test. In the load test, upon transmission of packets at a time from a plurality of sets of the arithmetic units of the source to the corresponding arithmetic unit of the destination included in the sets, respectively, a plurality of packets are received by the corresponding arithmetic unit of the destination through a network, respectively. In the process, the packets are sent at a time on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination, and therefore a heavy load is imposed on the parallel processor system. Further, the transmission time between each set of the arithmetic units is measured in the load test step. Thus, the performance is evaluated based by comparing the transmission time of each set measured in the load test step with the corresponding expected value for the particular set.
As described above, a load test can be conducted always under a heavy load in view of the fact that a plurality of packets are sent at a time on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic units of the destination. Further, the performance is evaluated with the actual measurement of the transmission time of each set as an expected value, and therefore a highly reliable test result is obtained.
According to still another aspect of the present invention, there is provided a computer readable recording medium for recording a load test program, the load test program being adapted to enable the computer to execute the operation comprising a transmission time measuring step of measuring the transmission time between each set of arithmetic units as an expected value based on the result of combining a plurality of arithmetic units accurately into a plurality of sets each including an arithmetic unit of the source and an arithmetic unit of the destination; a load test step a sending packets from a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination each constituting a set with the corresponding arithmetic unit of the source in such a timing that the packets arrive the network at the same time and measuring the transmission time between each set of arithmetic units based on the result of accurate combination of the arithmetic units on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination; and a performance evaluation step a evaluating the performance by comparing the transmission time of each set measured in the load test step with the corresponding expected value of the particular set.
According to the above invention, the transmission time between each set of arithmetic units is (actually) measured as an expected value in the transmission time measuring step before the load test. In the load test, packets are sent from a plurality of sets of the arithmetic units of the source to the corresponding arithmetic unit of the destination included in the sets, respectively, in such a timing that the packets arrive at the network at the same time, and therefore a plurality of packets arrive at the network at the same time. In the process, a maximum load is imposed on the parallel processor system, in view of the fact that the packets are sent and arrive at the network at the same time on condition that no packet is sent from a plurality of arithmetic unit of the source to a single arithmetic unit of the destination. Further, the transmission time between each set of the arithmetic units is measured in the load test step. Thus, the performance is evaluated in the performance evaluation step by comparing the transmission time of each set measured in the load test step with the corresponding expected value for the particular set.
As described above, a load test can be conducted always under a maximum load in view of the fact that a plurality of packets are sent in such a timing as to arrive at the network at the same time on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination. Further, the performance is evaluated with the actual measurement of the transmission time of each set as an expected value based on the transmission time under a maximum load, and therefore a more highly reliable test result is obtained.
According to still another aspect of the present invention, there is provided a computer readable recording medium for recording a load test program, the load test program being adapted to enable the computer to execute the operation comprising a transmission time measuring step of measuring the transmission time between each set of arithmetic units as an expected value based on the result of combining a plurality of arithmetic units accurately into a plurality of sets each including an arithmetic unit of the source and an arithmetic unit of the destination; a load test step of sending packets from a specified set of arithmetic units of the source longer in transmission time than the other sets to the corresponding arithmetic unit of the destination of the same set while at the same time sending packets simultaneously from the other sets of a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination of the respective sets and measuring the transmission time between each set of arithmetic units including the transmission time between the specific set of arithmetic units based on the result of accurate combination of the arithmetic unit on condition that no packet is sent from a plurality of arithmetic units of the source to a single arithmetic unit of the destination; and a performance evaluation step of evaluating the performance based on the result of comparing the transmission time of the specific set of arithmetic units and the transmission time of each other set measured in the load test step with the corresponding expected value of the specific set and each other set.
According to the above invention, the transmission time between each set of arithmetic units is (actually) measured as an expected value in the transmission time measuring step before the load test. In the load test, packets are sent from a specified set of arithmetic units of the source while a plurality of other sets of the arithmetic units of the source send packets at a time to the corresponding arithmetic unit of the destination, respectively, and a plurality of packets are received by the corresponding arithmetic unit of the other sets of the destination through the network. Further, the transmission time between each set of the arithmetic units including the specific set of arithmetic units is measured in the load test step. Thus, the performance is evaluated in the performance evaluation step by comparing the transmission time of each set including the specified set measured in the load test step with the corresponding expected value for the particular set.
As described above, the performance is evaluated by sending a packet from a specified set of arithmetic units while at the same time sending packets simultaneously in the other sets of arithmetic units, and therefore it is possible to determine the effect that the transmission of a packet from the specific set of arithmetic units has on the packet transmission by the other sets of arithmetic units.
According to still another aspect of the present invention, the fault diagnosis apparatus comprises a set determining unit for determining a plurality of sets of an arithmetic unit of the source and an arithmetic unit of the destination accurately; a packet production unit for producing a packet corresponding to each set with an identifier attached thereto for identifying the particular set; a storage unit for storing the test information including an identifier, the information on the arithmetic unit of the source of the packet with the identifier attached thereto and the information on the arithmetic unit of the destination of the packet with the identifier attached thereto, a transmission control unit for transmitting the packet with the identifier attached thereto from a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination, an information collecting unit for collecting the information on the receiving of the packet with the identifier attached thereto by the corresponding arithmetic unit of the destination, and a fault diagnosis unit for diagnosing a fault by referring to the test information using, as a key, the identifier in the information collected by the information collecting unit.
According to the above invention, in the absence of a defective point, packets with an identifier attached thereto are transmitted from a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination in the same set. In this case, the packets with an identifier attached thereto are received by the corresponding arithmetic unit of the destination through a network, and therefore the information collecting unit acquires the collection result to the effect that all the packets with an identifier attached thereto have been normally received. As a result, the fault diagnosis unit can determine that there is no defective point. In the presence of a defective point, on the other hand, the arithmetic unit of the source include those which have normally received the packets with an identifier attached thereto and those which have not such packets. In this case, the information collecting unit collects the information on the receiving condition (presence or absence of receipt) of each arithmetic unit.
The fault diagnosis unit refers to the test information using, as a key, the identifier of the normally received packets and the identifier of the unreceived packets, grasps the relation between the arithmetic units of the source and the arithmetic units of the destination taking the aforementioned receiving condition into account and makes a fault diagnosis by specifying a defective point. In the case where the result of referring to the test result shows that the packets with an identifier attached thereto which should be transmitted from a given arithmetic unit of the source are not received by any of the arithmetic unit of the destination, for example, the particular single arithmetic unit of the source is identified as a defective point. Also, in the case where the result of referring to the test result shows that the packets with an identifier attached thereto sent from all the arithmetic units of the source are not received by a given arithmetic unit of the destination, the particular arithmetic unit of the destination is identified as a defective point.
As described above, packets with an identifier attached thereto for specifying a set of a plurality of arithmetic units accurately are sent, and the relation between the arithmetic units of the source and the arithmetic units of the destination is grasped taking the receiving condition of the arithmetic unit of the destination into account based on the test information, and therefore a defective point can be identified with rapidity.
According to still another aspect of the present invention, there is provided a computer readable recording medium for recording a fault diagnosis program, the fault diagnosis program being adapted to enable the computer to execute the operation comprising a set determining step of determining a plurality of sets of an arithmetic unit of the source and an arithmetic unit of the destination accurately; a packet production step of producing a packet corresponding to each set with an identifier attached thereto for identifying the particular set; a storage step of storing the test information including the identifier, the information on the arithmetic unit of the source of the packet with the identifier attached thereto and the information on the arithmetic unit of the destination of the packet with the identifier attached thereto; a transmission control step of transmitting the packet with the identifier attached thereto from a plurality of arithmetic unit of the source at a time to the corresponding arithmetic unit of the destination; a information collecting step of collecting the information on the receiving of a packet with an identifier attached thereto by the corresponding arithmetic unit of the destination; and a fault diagnosis step of diagnosing a fault by referring to the test information using, as a key, the identifier in the result of information collected by the information collecting step.
According to the above invention, in the absence of a defective point, packets with an identifier attached thereto are transmitted from a plurality of arithmetic units of the source to the corresponding arithmetic unit of the destination in the same set. In this case, the packets with an identifier attached thereto are received by the corresponding arithmetic unit of the destination through a network, and therefore the information collecting step acquires the collection result to the effect that all the packets with an identifier attached thereto have been normally received. As a result, the fault diagnosis step can determine that there is no defective point. In the presence of a defective point, on the other hand, the arithmetic unit of the destination include those which have normally received the packets with an identifier attached thereto and those which have not received such packets. In this case, the information collecting step collects the information on the receiving condition (presence or absence of receipt) of the corresponding arithmetic unit.
The fault diagnosis step grasps the relation between the arithmetic unit of the source and the arithmetic unit of the destination taking the aforementioned receiving condition into account by referring to the test information using, as a key, the identifier attached to the normally received packets and the identifier attached to the unreceived packets, and makes a fault diagnosis by specifying a defective point. In the case where it is found, by referring to the test result, that the packets with an identifier attached thereto which should be transmitted from a given arithmetic unit of the source are not received by any of the arithmetic unit of the destination, for example, the particular arithmetic unit of the source is identified as a defective point. Also, in the case where the result of referring to the test information shows that the packets with an identifier attached thereto sent from all the arithmetic units of the source are not received by a given arithmetic unit of the destination, the particular arithmetic unit of the destination is identified as a defective point.
As described above, packets are sent with an identifier attached thereto for specifying a set of a plurality of arithmetic unit accurately, and the relation between the arithmetic unit of the source and the arithmetic unit of the destination is grasped taking the receiving condition of the arithmetic unit of the destination into account based on the test information, and therefore a defective point can be identified with rapidity.
Other objects and features of this invention will become apparent from the following description with reference to the accompanying drawings.