1. Field of the Invention
An embodiment of the present invention relates to a failure analysis apparatus, which may include a failure analysis apparatus that is implemented in an information processing apparatus having a plurality of boards mounted with a plurality of logic circuits and that analyzes what kind of failure has occurred in the logic circuits to realize a reduction in memory resources, faster processing, and a reduction in labor for development, and to realize a thorough analysis of critical failures, and to realize a reduction in the unanalyzable range.
2. Description of the Related Art
Today, an information processing apparatus is mounted with high-density, integrated, and complicated LSIs such as ASICs (Application Specific Integrated Circuit). In order to reduce a down time or a recovery time in the above apparatus, it is strongly demanded that a failure analysis function is realized to autonomously and quickly determine an accurate location of the failure, when the failure occurs in the LSIs, and to autonomously and quickly determine the affected range.
The progress in the integration of LSIs has led to a continuous increase in analysis information required for the failure analysis of LSIs. This requires an input operation of a large amount of analysis information. Further, communication is inevitable between a designer of the LSIs, a designer of the system mounted with the LSIs, and a designer of the firmware for analyzing failure of the LSIs. Therefore, an enormous amount of labor for the development is required to realize such a failure analysis function.
Thus, it is strongly desired to establish a new technique to efficiently realize such a failure analysis function.
An information processing apparatus mounted with ASICs usually includes a plurality of system boards mounted with a plurality of types of a plurality of ASICs.
For this reason, conventionally, when a failure occurs in ASICs, failure is analyzed for each of system boards using one or a plurality of analysis tables are prepared. And, an analysis results performed on every system boards are collected to deliver the analysis result of the entire system.
FIG. 15 illustrates a configuration of a conventional art.
In FIG. 15, reference numeral 100 denotes a plurality of system boards to be analyzed that are implemented in the information processing apparatus. Reference numeral 110 denotes a board analysis information table. Reference numeral 120 denotes a system analysis information table. Reference numeral 130 denotes an analysis processing unit.
The system boards 100 are usually mounted with a plurality of types of a plurality of ASICs. The board analysis information table 110 is defined for each system board 100, and stores information necessary for analyzing failures occurred in the ASICs mounted on the system boards 100. The system analysis information table 120 stores information necessary for analyzing failures between the system boards 100. The analysis processing unit 130 is provided with an analysis process function for analyzing failures of each system board 100 and an analysis process function for analyzing failures of the entire system.
Specifically, the analysis processing unit 130 is realized by firmware (hereinafter, may be referred to as monitoring firmware) implemented in the information processing apparatus. The board analysis information table 110 and the system analysis information table 120 are deployed on memories provided with the firmware.
In a conventional art configured this way, log information of the ASICs (hardware failure flags described below) is collected on every system boards 100. The board analysis information table 110 are defined for each of the system board 100, and are used to analyze a failure related to the system boards 100, thereby specifying the failure occurred in the system board 100.
After the failure analysis related to the system boards 100 is finished, the system analysis information table 120 is used. For example, in consideration of the fact that a failure detected in a receiver end has occurred in relation to a failure occurred in a transmitting end, the failure detected in the receiver end is excluded from the failure analysis. And, the failure analysis of the entire system is performed, thereby ultimately specifying what kind of failure has occurred.
In this way, in the conventional art, when a failure occurs in ASICs, the failure is first analyzed on every system boards 100, and then the analysis results on every system boards 100 are collected to deliver the analysis result of the entire system.
The designer of the ASICs or the designer of the system boards 100 creates the board analysis information table 110 required for performing the above failure analysis. And, the designer of the system or the designer of the system boards 100 creates the system analysis information table 120.
More specifically, in the conventional art, as shown in FIG. 16, the designer of the ASICs independently in collaboration with the designer of the system boards 100 creates a board analysis definition, which is data of the board analysis information table 110 before compiling, for each type of ASIC. The system designer, who manages the system independently or in collaboration with the designer of the system boards 100, edits the board analysis definition to create a system analysis definition, which is data before the compilation of the system analysis information table 120. The board analysis definition and the system analysis definition thus created are compiled into forms which can be imported to the monitoring firmware, thereby creating the board analysis information table 110 and the system analysis information table 120.
The analysis processing unit 130 uses the board analysis information table 110 thus created to analyze failures related to the system boards 100. In this case, as shown in FIG. 17, the analysis processing unit 130 stores hardware failure flags (flag group in hardware for showing the cause of failure in case of hardware failure) collected from the ASICs in a failure flag buffer reserved for failure analysis, and then executes a process of specifying what kind of failure has occurred.
When executing the process, the conventional analysis processing unit 130 stores hardware failure flags detected before the failure flag buffer is full into the failure flag buffer, and, when the failure flag buffer is full, the analysis processing unit 130 abandons hardware failure flags detected after the full of the buffer. And, the analysis processing unit 130 extracts what kind of hardware failure flags are stored in the failure flag buffer, thereby specifying what kind of failure has occurred.
Thus, when a large amount of hardware failure flags are set, the conventional analysis processing unit 130 discontinues the failure analysis after a certain number of detections, and reports the failure analysis result up to that point.
The analysis processing unit 130 analyzes failures using the board analysis information table 110 and the system analysis information table 120 created with a method as shown in FIG. 16. However, in the conventional analysis processing unit 130, as shown in FIG. 18, the board analysis information table 110 and the system analysis information table 120 that are information used in the failure analysis are permanently stationed in a memory of the monitoring firmware immediately after the startup of the system, although the failure analysis is a temporary process executed when an abnormality occurs in the system.
A memory space in FIG. 18 shows a system memory space of the monitoring firmware. Analysis information in FIG. 18 shows the board analysis information table 110 and the system analysis information table 120, both of which are information used in the failure analysis. An analysis work in FIG. 18 shows a work memory area used by the monitoring firmware in the failure analysis.
As described, when a failure occurs in the ASIC, in the conventional art, the failure on every system boards 100 is firstly analyzed, and then the analysis results on every system boards 100 is collected, thereby delivering the analysis result of the entire system.
In this way, in the conventional art, the failure analysis is performed on every system boards 100. Therefore, as shown in FIG. 19, for example, when hardware failure flags of one ASIC (for example, ASIC-D in FIG. 19) mounted on the system boards 100 cannot be collected, the entire failure analysis of the system boards 100 becomes impossible.
There are following problems according to such a conventional art.
(1) Problems in Relation to Memory Resources and Processing Time
According to the conventional failure analysis method based on every system boards 100, when analyzing failures, all hardware failure flags of the system boards 100 must be written into a work memory area (analysis work shown in FIG. 18) used for the failure analysis.
However, since several to several tens of ASICs are mounted on the system boards 100, the number of hardware failure flags in the entire system boards 100 is significantly large.
Therefore, there is a problem that a large amount of memory is required for the failure analysis according to the conventional failure analysis method based on every system boards 100.
Furthermore, the same type of ASICs is mounted on the system boards 100. And, according to the conventional failure analysis method in which the analysis is performed on every system boards 100, the board analysis information tables 110 are generated on every system boards 100. Thus, board analysis information tables 110 of same ASICs are duplicately generated. This also leads to a demand for a large amount of memory resources.
More specifically, even in the same ASICs, the board analysis information tables 110 differ according to the mounted places of each ASICs. However, in the conventional failure analysis method based on every system boards 100, a structure is not employed in which the analysis definitions according to the mounted places of each ASICs are described in the board analysis information tables 110. Thus, the board analysis information tables 110 cannot be shared. Therefore, a large amount of memory resources has been demanded, since the board analysis information tables 110 of the same ASICs are duplicately included.
Moreover, the failure analysis is a temporary process executed, when a failure occurs in the system. However, according to the conventional failure analysis method based on every system boards 100, the board analysis information tables 110 and the system analysis information tables 120, which are information used in the failure analysis, are permanently stationed in a memory of the monitoring firmware immediately after the startup of the system, as described in FIG. 18.
When the type or the version number of the ASICs mounted on the information processing apparatus is known in advance, only the corresponding number of the board analysis information tables 110 and the system analysis information tables 120 are permanently stationed. However, when the type or the version number of the ASICs is not known in advance, all tables for the ASICs which will be mounted on the information processing apparatus need to be permanently stationed, and a large amount of memory is required for the permanently station.
In this regard too, there is a problem that a large amount of memory resources are required according to the conventional failure analysis method based on every system boards 100.
Secondary, several thousands to several tens of thousands of hardware failure flags are needed for each ASIC. Then, the several hundreds of thousands of hardware failure flags are analyzed in the system boards 100 as a whole. Further, the board analysis information tables 110 are prepared on every system boards 100. Thus, a vast amount of calculations are required for searching the board analysis information tables 110.
For this reason, according to the conventional failure analysis method based on every system boards 100, there is a problem that an enormous amount of processing time is required for the failure analysis.
(2) About Labor for Development
In the conventional failure analysis method based on every system boards 100, two kinds of tables, the board analysis information table 110 and the system analysis information table 120, are used for analyzing failures. As described in FIG. 16, the designer of the ASICs or the designer of the system boards 100 creates the board analysis information table 110, and the designer of the system or the designer of the system boards 100 creates the system analysis information table 120.
Therefore, according to the conventional failure analysis method based on every system boards 100, labor for the development are generated during the initial design or the modification designs of the tables 110 and 120, and there is a problem that burdens are imposed on the designers.
Moreover, it is inevitable that the designer recognize the description definition of the analysis information in different ways. Therefore, according to the conventional failure analysis method based on every system boards 100, there is a problem that an error occurs due to the difference in recognition.
(3) About Missed Analysis
As described in FIG. 17, the conventional failure analysis method discontinues the failure analysis after a certain number of detections, since the failure flag buffer cannot store the hardware failure flags when a large amount of the hardware failure flags are set.
Therefore, according to the conventional failure analysis method, there is a problem that more critical failures are missed which are detected after the failure flag buffer has become full.
(4) About Unanalyzable Range
In the conventional failure analysis method based on every system boards 100, as described in FIG. 19, there is a problem that the entire failure analysis of the system boards 100 becomes impossible in a situation such as when the hardware failure flags cannot be collected from even one ASIC mounted on the system boards 100 due to some kind of a secondary problem.