In a computer (e.g., personal computer (PC) or the like), the abnormal termination of a software process by either the operating system (OS) or an end user indicates the possibility of a defect (bug) in the software. Software typically contains a number of bugs classifiable into two general categories: crashes and hangs. Among the chief concerns for program developers has always been identifying software defects that cause computers to crash. Software crashes are fatal system errors, which usually result in the abnormal termination of a program by a kernel or system thread. Normally, when a crash-causing bug is discovered, the software provider obtains diagnostic data, attempts to reproduce the error, and, depending on the severity of the bug, creates and distributes a fix for the bug.
One way of diagnosing crash-induced bugs involves examining a log file containing diagnostic data including commands, events, instructions, program error number, computer processor type, and/or other pertinent diagnostic information. The log file typically is generated right after a crash has been detected. For example, a Microsoft® Windows operative PC loads Watson, a debugging tool which monitors running processes and logs useful diagnostic data when a crash is detected. After a crash, the Watson log file may be sent to the software provider for analysis. In some cases, a log file does not contain enough information to diagnose a problem, thus, a crash dump may be required to troubleshoot the problem. A crash dump is generated when the physical contents of memory are written to a predetermined file location. The resulting file is a binary file. Analyzing crash dumps is more complex than analyzing log files because the binary file usually needs to be loaded into a debugger and manually traversed by a troubleshooter.
In an effort to more effectively troubleshoot bugs, some software providers attempt to perform varying degrees of computerized analysis on log and crash files. For example, Microsoft has introduced its Online Crash Analysis (OCA) engine to automate the process of troubleshooting crashes. The OCA engine allows users to submit, through a web browser, a crash log or a crash mini-dump file to Microsoft. The analysis engine compares data from the uploaded file to a database of known issues. If the bug is known and a patch or workaround is available, the user is notified of the solution. Otherwise, the uploaded file is used by troubleshooters to diagnose the bug.
A problem with all of the above-mentioned troubleshooting techniques is that they attempt to diagnose crashes only, overlooking hangs, the second major class of bugs. Moreover, these approaches rely heavily on manual analysis of bugs and require the user to send in a report to the software provider, where most of the analysis is performed, wasting the software provider's resources.
In reality, many reported bugs are related to hangs. However, software providers typically expend their debugging efforts fixing crash-inducing bugs, even though, to end-users, crashes and hangs often appear to be the same thing. A software hang occurs when a piece of software appears to stop responding or when a software thread looks inactive. Hangs often result in the abnormal termination of a recoverable software process by the end-user. Abnormal termination of software by any means, including user-induced termination, may indicate the presence of a bug in the software. For example, a piece of software may normally take 10 or 15 seconds to paint a user interface, but under a given set of circumstances, the user interface thread may call an API that takes a long time to return or, alternatively, the user interface thread may make a network call that requires a response before painting the user interface. Thus, the time to paint the user interface in this instance may take an abnormally long 50 or 60 seconds to finish. Because of the abnormal delay, a user may become frustrated and manually terminate the application after 20 seconds. The fact that the user interface became unresponsive, in this instance, is a bug because it caused the user to abnormally terminate the software.
Another example of a hang involves a scenario where a software application crashes because of an error in a related dynamic link library (.DLL) file. In this scenario, at the time of the crash, the software application has acquired certain system resources, like file handlers and critical sections, which are not released after the crash. Other threads need access to those acquired resources, but cannot gain access to them because they are still marked as locked by the crashed thread. Because of the lock, other running threads hang. The fact that other threads hung indicates a bug that may need to be diagnosed and fixed.
One of the difficulties software providers encounter when troubleshooting hangs is that they are hard to identify, diagnose, and reproduce. For example, hangs are usually not as dramatic as crashes, e.g., there may not be an obvious “blue screen of death”-type response by a computer to indicate a bug, so users are less likely to report the error. Moreover, crashes are easier to diagnose since they tend to occur after a specific instruction or event has been issued. In contrast, identifying the offending instruction or block of code in a hang may be more difficult to do since the bug could be related to another piece of software, to a specific environment on a PC, to an impatient user, or to any number of other issues. Thus, software providers often do not emphasize hangs when fixing bugs.
Therefore, there exists a need for tools to troubleshoot hangs. More specifically, there exists a need for automating the process of diagnosing and troubleshooting software hangs. There also exists a need for client-side tools to aid in the diagnosis of bugs in order to free software provider resources.