1. Field of the Invention
The present invention relates, in general, to systems and methods for identifying and resolving problems in computer system software and hardware, and more particularly, to an automated service tool or guru device and method for processing kernel and user core files and other data sources proactively and reactively to identify possible computer system problems or bugs and identify remedial actions to address the identified problem.
2. Relevant Background
Computer system designers and analysts face the ongoing and often difficult task of determining how to fix or improve operation of a computer system that has experienced an unexpected exception or is failing to operate as designed (e.g., is experiencing errors caused by software problems or “bugs”). When a problem or bug in the computer system software is serious enough to stop or interrupt the execution of a running program, this failure is known as a crash. To assist in identifying bugs in the software operating on a computer system, software applications are often configured to create a crash dump or memory dump when an unexpected exception occurs to generate a memory image of the existing state of software executing on the system at the time of the crash or exception. These memory images are sometimes called core files (or dump files).
The system-level commands or programs in the operating system, i.e., the kernel software, are of particular interest to system analysts in correcting bugs in a crashed computer system. For example, in an UNIX®-based system, the kernel is the program that contains the device drivers, the memory management routines, the scheduler, and system calls. Often, fixing bugs begins with analysis of these executables, which have their state stored in a kernel core file. Similarly, user programs or binaries (e.g., binary, machine readable forms of programs that have been compiled or assembled) can have their state stored in user core files for later use in identifying the bugs causing the user applications to crash or run ineffectively.
Instead of writing a new, complete replacement version of the software (that crashed or had bugs), the designer or developer often prepares one or more small additions or fixes to the original software code (i.e., patches) written to correct specific bugs. For example, when a specific bug is identified, a patch is written or obtained from a third party to correct the specific problem and the patch is installed on the computer system. A single patch often contains fixes for many bugs for convenience. However, a particular bug is usually, but not always, fixed by a single patch (i.e., multiple patches usually do not address the same bugs). Typically, system analysts or operators keep or acquire records of previously identified bugs and corresponding patches installed for each identified bug. Then, when a bug is encountered in a system, the system analyst efforts to fix the problem begin with a search of these records of prior bugs to identify the bug or find a similar, previously-identified bug. Once the bug is identified, a relevant patch is selected that may correct the problem or a new patch may be written similar to or based on the previous patch. Additionally, the analyst may determine if a newer version of the patch is now available.
For example, a bug may be identified that causes an exception, such as causing the computer system to fall into panic when two specific programs are run concurrently. A record of the bug would then be created and stored in a database including a bug identifier (e.g., alpha-numeric identification code) along with descriptive information such as a synopsis describing the problem (for the above example, “system falls into panic while shutdown procedure is executed during writing”) and information describing the results or symptoms of the bug (e.g., a crash, hang, stack trace, type of panic, and the like). Once a fix for the bug is available, a patch may be created containing the bug fix and other bug fixes. A patch record is associated with each patch. The patch record includes identifying information such as a patch identifier (e.g., an alpha-numeric code), references to corrected or addressed bugs, textual description of the purposes of the patch, references to specific software useful with the patch (e.g., a specific user application, kernel software for specific operating systems, and the like), dependent packages, related patches, and other useful identifying and patch-user information.
While providing useful information to a system analyst, the volume of information in these bug and patch files usually grows into a very large, unmanageable amount of information (e.g., 500,000 and more bug entries for widely-used operating computer systems and networks), and the amount of data in these files continues to grow as new bugs and patches are identified, created, and installed. Hence, the task of identifying appropriate patches for an identified bug is a difficult task, and system analysts often resort to making educated guesses for searching these lengthy patch records.
Existing methods for identifying appropriate patches to correct bugs typically require users to provide important input or make critical choices and do not meet the needs of system analysts. System analysis methods and tools are typically fully or partially manual “search” processes involving manually entering search terms to process the large patch record lists, identifying potentially relevant patches, and then manually selecting one or more patches for installation. The existing systems are heavily interactive and require the system analyst to provide a relatively large amount of knowledge to obtain good results. For example, some system analysis tools require a user to select which problem analysis or resolution tool to use and to select which databases to search. The effectiveness of this tool is tied to the ability of the user to search a database containing a subset of possible problems with appropriate search terms. When a list of bugs or patches is obtained, the user again must manually, based on their experience, select the correct problem and a useful fix for the selected problem. Clearly, the existing “search” systems allow for human error to become a problem and are inherently labor intensive systems.
In addition, the first step of analyzing a resulting core file to accurately identify a bug causing the problem is an even more difficult task than the above “searching” processes. The core file analysis tools presently available are typically only useful for kernel core files and are difficult to effectively use (e.g., require extensive training and knowledge of the system being analyzed which often can only be gained with years of working experience). The tools are generally only used reactively, i.e., once a problem occurs, and are interactive with the user, i.e., are manual not automatic tools. Again, these tools are often ineffective as human error can result in an incorrect or inefficient remedy being recommended to correct the computer system operating problems.
Often, the operator is unable to identify a single, specific patch for the problem and is forced to install numerous patches to increase the likelihood that the bug will be corrected. This inaccurate “over” patching is often time consuming, costly, and disruptive to the computer system, which may not be acceptable to users of the system. Additionally, some patches are not effective or are counterproductive when installed with other patches. Further, some patch tools are available to identify patches that are installed on the computer system for which new versions are available (which in many systems are hundreds of patches at any given time), but these tools do not assist in identifying a particular patch for correcting an identified bug.
In addition, problems that a computer system may encounter are not limited to just software bugs. The problems may include hardware problems, configuration specific issues (hardware or software), performance problems, security issues, firmware bugs, availability issues, functionality problems, and other problems. These problems often have workarounds or procedures that operators need to be aware of and to act on.
Hence, there remains a need for an improved method and system for identifying and resolving current and potential computer system problems of all types. Such a method and system preferably would be configured to be used online and offline and require little or no operator training. Further, the method and system preferably would be useful as a planning tool such as by providing proactive analysis of computer systems.