The present invention generally relates to debugging software programs and, more specifically, to techniques for debugging database systems.
In a database system, an area of system memory is allocated and one or more processes are started to execute one or more transactions. The database server communicates with connected user processes and performs tasks on behalf of the user. These tasks typically include the execution of transactions. The combination of the allocated system memory and the processes executing transactions is commonly termed a database xe2x80x9cserverxe2x80x9d or xe2x80x9cinstancexe2x80x9d.
Like most software systems, a database server has complicated shared memory structures. A shared memory structure contains data and control information for a portion of a database system. Because of software, hardware, or firmware bugs that may exist in a complex database system, shared memory structures may become logically incorrect. When structures become logically incorrect, the database is likely to fail. Database failure is typically discovered in the following ways: by checking consistency of structures; by verifying certain assumptions; or by running into corrupted pointers. Attempting to process corrupted pointers will lead to a xe2x80x9ccrash,xe2x80x9d where normal database operation is no longer possible.
A major responsibility of the database administrator is to be prepared for the possibility of hardware, software, network, process, or system failure. When shared structures are presumed to be corrupted, the best course of action for a database administrator is to cease further processing of the database. If a failure occurs such that the operation of a database system is affected, the administrator must usually recover the database and return the database to normal operations as quickly as possible. Recovery should protect the database and associated users from unnecessary problems and avoid or reduce the possibility of having to duplicate work manually.
Recovery processes vary depending on the type of failure that occurred, the structures affected, and the type of recovery that is performed. If no files are lost or damaged, recovery may amount to no more than rebooting the database system. On the other hand, if data has been lost, recovery requires additional steps in order to put the database back into normal working order.
Once the database is recovered or rebooted, the immediate problem is quickly resolved, but because the root cause is still undetermined and therefore unresolved, the error condition may resurface, potentially causing several additional outages. Therefore, it is still important to diagnose the state of the structures and data surrounding the database failure. Such a diagnosis may provide valuable information that can reduce the chance of failure in the future. As a practical matter, diagnosing the failure may lead to determining which vendor""s hardware or software is responsible for the database failure. Such information is valuable for a vendor""s peace of mind, if nothing else. Thus, competing with the goal of recovering the database as quickly as possible, is the goal of determining why the database system failed in the first place.
Unfortunately, even with traditional techniques of diagnosing a database failure, the system administrator is usually unable to obtain a sufficient amount of clues to determine why the failure happened. A deliberate and thorough diagnosis of the failure may require an unacceptable amount of database downtime. For example, any amount of downtime over 30 minutes may be extremely costly for a database that is associated with a highly active web site. Too much downtime may have unduly expensive business ramifications, such as lost revenue and damage to the reputation of the web site owner.
Traditional debugging techniques involve formatting certain parts of the database system and displaying this formatted portion in a human-readable form. This human-readable form can be set aside for later analysis, for example, after the database has been recovered or is no longer down. The entire memory of the database server is not dumped because an average database server is very large, typically between about 200 megabytes and about 100 gigabytes of unformatted binary and data. On the portion of the database that is dumped and formatted, an educated guess is made of the key data structures that are potential causes of the problem.
For the foregoing reasons, what is needed is a method of debugging a software program, such as a database system, that can be performed in a manner that requires minimal downtime, yet allows for a comprehensive assessment of a failure.
In one embodiment, the method of debugging a software program comprises preserving a memory state of a portion of the software program, such as a database system. The memory state is preserved when a failure event is detected in the software program. The preserved memory state portion of the software program is extracted and stored in a storage medium for deferred analysis. Normal database operations are resumed as soon as the memory state is preserved. The deferred analysis is performed by starting a new database instance corresponding to the preserved memory state portion and using the new database instance to extract information for high-level debugging of the software program. Thus, where downtime of a software program must be kept to a minimum, the present invention provides techniques for performing quick diagnostics of the software program.