The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A key problem in software systems is how to ensure that the software systems are continuously available. A software system must mask and quickly recover from various failures, such as hardware failures and failures due to errors in the software code itself.
Traditionally, software errors (also known as bugs) in the code of the software system that affect the performance of the software system are classified into “Bohrbugs” and “Heisenbugs”. Bohrbugs are 100 percent reproducible—i.e. if the same sequence of operations is executed by the software system, the error will cause a failure again. Usually, software systems have only a few Bohrbugs because most such errors are detected during the testing phase of the software system life-cycle. Heisenbugs, on the other hand, are highly dependant on the timing of various events. They are difficult to reproduce because even if the software system executes the same sequence of operations, the timing of events during the underlying execution of these operations may vary in a way that the error does not affect the performance of the system. Complex multi-threaded software systems, such as operating systems or database systems, usually have many Heisenbugs.
There are some errors, however, which are not 100 percent reproducible but are not very rare either. If the software system executes the same sequence of operations a sufficient number of times, these software errors are bound to cause a failure or affect the performance of the operations. These errors can be classified as an intermediate type of errors, and they are the most common type of errors affecting the performance of the software systems. The reason for the intermediate type errors being most common is that Bohrbugs are found during the beta-testing of the software system. Heisenbugs are annoying, but usually do not cause significant downtime because they are usually corrected by restarting the software system. Moreover, some software systems, such as database systems, have the capability to automatically restore the software system to a consistent state after encountering an error, thus further making Heisenbugs less likely to cause a significant downtime. In fact, if the software system includes several instances running in a cluster, it is possible that a Heisenbug encountered in one instance does not cause any downtime because other instances in the cluster will continue to be available.
The nature of the intermediate type of software errors suggests that the majority of these errors come from code that is newly introduced in the software system. These errors are sufficiently reproducible and by adding one or more diagnostic events, and after sufficient information is collected, a programmer can determine the cause of the error and fix it. Typically, errors of the intermediate type are easy to fix but hard to find. For this reason, many of these intermediate type errors are fixed in software patches or service packs that are released after the software system has been introduced in the market. In contrast, Heisenbugs are hard to find and hard to fix because there may not be a chance to collect the necessary diagnostic information. Heisenbugs may exist in really old code, and some of these Heisenbugs may go unreported by a user because they have caused a software system failure, or have affected the performance of the software system, just one time.
One currently available approach to correcting software errors of the intermediate type involves the participation of one or more software engineers or customer support personnel. When a user of a software system encounters such an error, the user files a Technical Assistance Request (TAR). A system support engineer, usually employed by the vendor of the software system, processes the TAR and determines that either (1) the error is known and has been fixed in a patch or service pack release for the software system, or (2) the error is not known.
The first case, where the error is known and fixed in a patch release, is obviously the simpler case, but even this case is complex and difficult to resolve. Typically, the user has encountered other problems and usually has filed multiple TARs. Thus, multiple software engineers or support personnel may become involved before determining which of the reported problems is the most serious and is causing the software system to fail. After the support personnel determines which error is affecting the software system or is causing the software system to fail, the user has the choice of either upgrading to the existing patch, or waiting for a one-off patch fixing only the reported error to be released. This presents the user with a Hobson's choice because even an upgrade to a patch release is a complex task that may require days of planning. Hence, most of the time, the user has only one realistic option, and it is to apply a one-off patch that is already available or is to be made available for the particular software system. Thus, resolving this case easily takes at least a day, during which time the user may continue to experience software system failures.
The second case, where the user reported a previously unknown error, is even more difficult and time-consuming to resolve. Typically, there are several rounds of interaction between the user that encountered the software error and the software support personnel. In these rounds of interaction, diagnostic event settings are suggested to the user, the user collects the information, and sends this information to the software support personnel. In some cases, it takes several months for the correct diagnostic event settings to be suggested and the information to be collected. There are many errors and omissions in this process, both by the software support personnel and the administrator or administrators at the user site. Educated guesswork is heavily used during this trial and error process. During this time, the user may have suffered significant downtime.
Another currently available approach to resolving intermediate type software errors is to use a separate standby software system that is a mirror of the primary software system. The standby software system could be running an older stable release of the system. The users and applications are failed over to the standby software system when the primary system is unavailable. Since such fail-over represents a significant change for users and applications alike, software systems are switched over in this manner only in the case of major disasters at the primary site or during long periods of planned unavailability (such as during a major upgrade).
Another general approach to resolving intermediate software errors is to use N versions of the same software system where each version is written by a different set of software developers. This technique, called N-version programming, is rarely feasible except for the highest-end systems, because of the cost of developing and maintaining N separate versions of the software system.
All of the currently available approaches for tracking and recording errors affecting the performance of a software system usually involve recording the error information and the system or process state information in Operating System trace files. These trace files are not managed and maintained by a centralized system, such as a database system, and thus in order to track an error through the trace files these files must be correlated manually. Such manual correlation usually requires great skill and expertise on behalf of the support engineer investigating the problem.
Therefore, there is clearly a need for techniques for tracking, diagnosing, and correcting or circumventing software errors that overcome the shortfalls of the approaches described above.