From their inception, computers have been fairly complicated machines, involving many different parts and circuits that must operate properly and interact properly for the computer to function as designed. As new and better computing technologies arise and more complex programs and programming techniques are employed, the complexity of computing systems continues to increase. Moreover, all facets of society and commerce increasingly rely on computing technology in mission critical scenarios. Thus, the complexity of computing systems and the stakes for failure of those systems continue to increase together, compounding the probability and impact of any errors in hardware or coding.
The development of software is a process that typically requires extensive human interaction. Thus, the potential for errors to be introduced is significant. While many of these errors can be identified and corrected before the software of interest is distributed, this is not always possible. Some errors only occur under very specific conditions, under circumstances that have not been tested by the developer of the software. For example, once a piece of software is distributed to the public, it will be combined with an essentially endless and unforeseeable variety of computer hardware and other programs. The program of interest may interact with any of these things in an unforeseen way.
Due to the complexities of software and the interactions between software entities and/or hardware, it is often prohibitively costly to investigate all user computer system errors. However, if a significant number of users experience a particular error, the amount of user dissatisfaction due to that error justifies an investigation of the problem, regardless of whether the problem is due to the developer's code itself or to an interaction with another party's code or hardware. Thus, it is important for the developer to be able to determine when errors have occurred, and to be able to classify those errors to recognize repeated occurrences of a specific problem.
A number of standard approaches exist in order to verify that a computer process or application (typically consisting of a single process) is operating as expected—or more usually to diagnose failures to operate as expected. These include, among other techniques (1) use of a live debugging or in-circuit emulation to trap execution when certain conditions are met; (2) use of instructions embedded in code (e.g. Assert or other instrumentation) to trace execution; (3) profiling or otherwise tracing the execution of the process' threads; and (4) recording a dump of process memory, including call stacks for subsequent analysis. Such solutions usually have significant drawbacks, however, such as requiring changes to the code (instrumentation); requiring diagnostic personnel to be available and on-site when a problem occurs (live debugging); seriously degrading computer performance (profiling, extensive logging or dumping memory); and/or requiring computer users to send large quantities of data back to the vendor (memory dumps). Thus, existing techniques have significant shortcomings, and in addition to these, for vendors who did not develop the code under analysis, techniques that require changes to the code are impractical.