Measurement of software reliability is essential for assessing and improving the availability of network elements such as routers and switches and for efficiently and effectively trouble shooting. For example, kernel software refers to the core operating system software used for controlling the execution of a Route Processor (RP). Failure of the kernel software will cease operation of the RP, leading to a complete RP outage. Measurement of kernel software outages and classifying reasons for these software outages are especially important for high availability network operations.
Prediction and measurement of Mean Time Between Failure (MTBF) are two major software reliability estimation approaches. Unlike hardware MTBF estimation, the prediction of software MTBF is very difficult due to the lack of systemic techniques. There have been several attempts to predict software MTBF based upon lines of code, bugs, “if” statements, etc. However, these techniques are still experimental and have not been proven effective in commercial applications.
Certain crash information currently exists for measuring kernel software MTBF in the field. One technique monitors a network device's reboot reason via Simple Network Management Protocol (SNMP) notification from a special Network Management Server (NMS). The software MTBF is stored and calculated in a remote NMS device. In another process, all error and register related information is dumped into a persistent memory file during router crash time. This file is then reviewed manually off-line for software MTBF calculation and analysis. However, this reboot information and the manual techniques mentioned above do not address the need for accuracy, scalability, cost effectiveness, and manageability.
For example, the SNMP based measurements can be lost due to unreliable SNMP traps of the remote device based measurement. The SNMP based measurements also can not capture certain software failure events such as a standby RP failure or a forced switch-over event in dual-RP systems. There are also no specific rules for distinguishing software-caused crashes from other types of router crashes.
Current outage measurement schemes are also unable to automatically distinguish operations related to software outage events. Thus, all dumped crash information has to be manually searched by a system administrator for specific types of software related outage information. Outage reasons and MTBF information then has to be manually generated by the system administrator.
The present invention addresses this and other problems associated with the prior art.