The field of the present invention is the troubleshooting of hardware failures and the maintenance of computers.
More particularly, it relates to a process for analyzing information that is recorded the moment a malfunction is detected in the computer, in order to locate the component or components that caused the failure and to replace only the malfunctioning components.
It also relates to a tool for analyzing and locating failures and a computer that incorporates the tool.
The constant decrease in the price of computing machines sometimes leads manufacturers to lower the quality of certain hardware components.
A component can be, for example, an ASIC or xe2x80x9cApplication Specific Hardware Circuit,xe2x80x9d or a processor.
The user is therefore more and more frequently confronted with problems linked to hardware-related errors. All of the current machines are more or less capable of finding these errors, which can sometimes lead to failures in certain parts of the machine, or to a complete shutdown of the machine.
Each sensible component of a machine has status registers indicating the performance level of the component in question.
A given status of the machine is characterized by a xe2x80x9csignaturexe2x80x9d of its status registers, i.e., a characteristic value of each register for this given status.
It is these values that constitute the information that will subsequently be analyzed by the machine.
It is possible to distinguish several types of failures in a computing machine.
In a first type, the failure causes a minor error that remains localized at the component level and is immediately corrected by the software that controls this component, and therefore the user does not experience any disturbance of his work.
In a second type, the failure can cause an error whose seriousness makes it no longer possible to guarantee the integrity of the data processed and may make it necessary to restart the machine.
The present invention relates more specifically, though not exclusively, to this second type of failure, which can cause interruptions in the operation of the machine, also known by the respective terms xe2x80x9cmachine checkxe2x80x9d and xe2x80x9ccheckstop.xe2x80x9d
In the case of an interruption of the xe2x80x9cmachine checkxe2x80x9d type, the information collected is targeted to the component that detected the error, while in the case of an interruption of the xe2x80x9ccheckstopxe2x80x9d type, all the xe2x80x9csignaturesxe2x80x9d of the status registers of the machine are collected.
In both cases, it is then necessary to interrupt the values of the status registers in order to determine the error and possibly deduce its cause.
Each component of the machine is more or less directly linked to one or more other components of this machine, which will be called xe2x80x9cneighbor components.xe2x80x9d If a component has a defect, it is revealed by the neighbor components in their status registers. The user is then warned that there has been a failure in the machine, but in certain cases, there is nothing that allows him to know exactly which component is the defective one that caused the error.
There is still the signature of the status registers of the machine in case of error, but not an overall view of the status of the machine. There is a gap in the information. The information known is precise, but partial (the status registers) and global, but imprecise (there is an operational error). When the error results in a hard stop of the machine, it is necessary to pore through a thick manual to find the meaning of the status registers. It requires the help of an expert to perform a global analysis of these registers a posteriori.
The existing error analysis tools can provide all of the values of the registers in text form and can even perform the analysis of these values. However, the description of the status registers and the rules for interpreting their contents are buried in the machine code of these tools.
Since a tool is generally dedicated to one hardware version, it is not possible to add new descriptions of registers or new rules of interpretation without creating a new version of the tool.
The object of the invention is to specifically eliminate these drawbacks.
To this end, the subject of the invention is a process for analyzing and locating hardware failures in a computing machine storing information on operational errors generated by the various sensible hardware components of the machine.
It is characterized in that it consists of creating a man/machine interface through which the components and the rules for interpreting errors are described in a structured language and used by the machine as external parameters in correlation with the error information to detect the malfunctioning component or components.
Another subject of the invention is a tool for analyzing and locating hardware failures in a computing machine comprising means for storing error information generated by the sensible components of the machine.
It is characterized in that it includes an error analysis engine receiving through a first series of inputs the error information, and receiving through a second series of inputs the parameters required for the description of the sensible components of the machine and for the description of the rules for interpreting errors, and in that it includes a man/machine interface between the tool and the component expert to allow him to formulate the parameters in a structured language.
Finally, another subject of the invention is a computer that incorporates the tool defined above.
The formulation of the parameters for describing the registers and the rules for interpreting errors according to the invention makes it possible to add new descriptions or to enrich the interpretation simply by editing source files written in a given format, without having to create a new version of a tool with each hardware upgrade.
Moreover, the architecture of the tool according to the invention is scalable and its maintenance is facilitated by separating the analysis tool itself (the engine), which processes the information in xe2x80x9cmachinexe2x80x9d code, from the descriptions of the status registers and the interpretation rules written in xe2x80x9csourcexe2x80x9d code.