Understanding the cause of a software failure can lead to improved software product reliability. In the past, improving the reliability of software products in part was done by analyzing failure data (sometimes referred to as a crash dump) that a computer system collects with respect to a program failure, e.g., when the program exits unexpectedly or freezes and has to been manually terminated in an external, atypical way.
In an operating system such as Microsoft® Windows®, the failure data may be categorized to an extent by product, referred to as “Watson” buckets (after the DrWatson mechanism (e.g., application, logs and dump files) used for collecting crash dump data. The Watson buckets contain the details about user actions, program state and the like that may have led to the crash, including alerts and asserts (exception error messages). A typical approach for product teams is to fix a certain percentage of the bugs corresponding to their Watson buckets before product release.
However, such a straightforward approach does not always lead to improvement in product reliability. For example, a bug that relatively frequently occurs may be fixed with this approach, while a bug that occurs rarely may not be addressed. In general, this approach does not provide the flexibility to better understand the overall reliability picture of a software product.