In recent years, technological development has been promoted in relation to a distributed processing apparatus, such as grid computing, that performs a large amount of jobs in a distributed manner using a large number of computers connected via a network. In such a distributed processing apparatus, optical transmission is widely used for communication performed by central processing units (CPUs).
The distributed processing apparatus, upon detecting a transmission failure in communication between CPUs, acquires register information stored in registers included in a transmission control unit and an optical transmission/reception module, and stores the register information as log information in a storage area. As the optical transmission/reception module, for example, an active optical cable (AOC) that includes a connector having a built-in optical-to-electrical conversion unit may be used. The register information includes a voltage, a temperature, a vendor code, a serial number, or the like. The register information also includes failure information, such as information indicating interruption of optical signal communication, a failure in clock synchronization using a data signal, or the like.
The log information obtained at the time of occurrence of a failure is important information for specifying an object to be subjected to maintenance, and it is preferable that the failure information included in the register information is exhaustively acquired without a missing part. A time error usually occurs between when failure information is reflected in a register of the AOC and when a CPU acquires the failure information of AOC from the register of the AOC. To cope with this, the AOC holds the register information stored in the register until the register information is read by the CPU. After the CPU reads the information, the AOC clears the register.
Among registers included in the AOC, a plurality of alarm registers for storing failure information are provided depending on types of the failure information, and one byte is assigned to each of the alarm registers. Further, a unit of readout from the register is one byte. The register as described above is referred to as a register of a one-byte-based clearing-triggered-by-readout system, and is usually adopted as an alarm register of the AOC or the like as a common standard.
In some cases, the AOC may include a plurality of channels and each of the channels may be assigned to one of a plurality of CPUs. In contrast, the alarm register of the AOC has a bitmap, in which a bit is assigned to each of the channels, and pieces of information for the plurality of channels are stored in the 1-byte alarm register. A CPU performs interrupt processing upon detecting a signal transmission failure, and acquires failure information from the alarm register of the AOC. If each of CPUs that share a single AOC detects a signal transmission failure, a failure information readout operation is performed a plurality of number of times on the single AOC.
As a technology for acquiring the failure information as described above, there is a conventional technology in which a plurality of targets are monitored, and when new event information is collected, the new event information is stored in combination with existing event information that has not yet been read by a CPU.
Patent Document 1: Japanese Laid-open Patent Publication No. 2008-90505
However, when a CPU once reads failure information stored in the alarm register of the AOC, failure information corresponding to a plurality of CPUs are readout, and thereafter, the failure information corresponding to the plurality of CPUs are collectively cleared. In this case, when a single CPU reads the failure information, the alarm register is cleared after the readout; therefore, the failure information corresponding to the other CPUs may be lost in some cases. Therefore, it is difficult for a maintenance administrator to recognize occurrence of failures in the other CPUs and perform appropriate maintenance.
Furthermore, even when a conventional technology for storing the collected new event information together with the existing event information is used, if a plurality of CPUs perform readout, failure information is cleared after a certain CPU reads the failure information. Therefore, it is difficult for the other CPUs to acquire the failure information.
Moreover, as a method to cope with the situation as described above, it may be possible to use a method of causing a CPU that has read failure information to hold the all read failure information, and cause the other CPUs to use the information held by the CPU that has read the failure information when the other CPUs analyze failures. However, in this method, the amount of information to be analyzed by the other CPUs increases, and costs for the failure analysis may increase.
As another method to cope with the above-described situation, it may be possible to use a method of changing a unit of readout from the register to a unit of one bit. However, in this case, an AOC having a special specification different from a common specification is developed. Therefore, development processes and development costs may increase.