A multitude of studies on the Total Cost of Operation (TCO) show that almost half of TCO, which in turn is five to ten times the purchase price of the system hardware and software, is spent in resolving problems or preparing for imminent problems in the system. See, for example, David A. Wheeler, “Why Open Source Software/Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!”, available at http://www.dwheeler.com/oss_fs_why. html#tco, Revised as of Apr. 12, 2007; and Gillen A., Kusnetzky, McLaron S., The role of Linux in reducing cost of enterprise computing, IDC white paper, January 2002. Hence, the cost of problem determination and resolution (PDR) represents a substantial part of operational costs.
Making PDR cost effective, for example, through standardization and asset reuse has not worked in traditional information technology (IT) environments. See, for example, “WebSphere Application Server V6 Problem Determination for Distributed Platforms”, SG24-6798-00, Redbook, 20 Nov. 2005; and “DB2 Warehouse Management: High Availability and Problem Determination Guide”, SG24-6544-00, Redbook, 22 Mar. 2002. The IT resources being dedicated to a particular customer and their applications, leads to a diversity of configuration among IT environments and applications that makes it difficult to programmatically reuse scripts, workflows, lessons learned from one environment to another.
Existing art in the area of problem determination and resolution provide methodology restricted to particular products, such as in “WebSphere Application Server V6 Problem Determination for Distributed Platforms”, SG24-6798-00, Redbook, 20 Nov. 2005; and “DB2 Warehouse Management: High Availability and Problem Determination Guide”, SG24-6544-00, Redbook, 22 Mar. 2002, which provide problem troubleshooting guidance from the developer perspective for WebSphere™ and DB2™, respectively. These guides, although very informative, address only potential problems that have been identified in the product pre-production phase and have been categorized in error codes integrated in the product. They do not consider the historical troubleshooting knowledge related to fixing uncategorized failures in production environment at a customer's site.
Oren Laadan, Ricardo A. Baratto, Dan B. Phung, Shaya Potter, and Jason Nieh, DejaView: A Personal Virtual Computer Recorder provides a virtual computer recorder that captures the user's computing experience, which a user can play back, search, and browse. The tool records the visual output, the corresponding checkpoint and the file system state, and allows the user to annotate particular screenshots and system snapshots for future search. The system checkpoints are related to visual changes (e.g., no checkpoint is taken if the screen does not change), rather than to system changes.
In Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, “Peer Pressure for Automatic Troubleshooting,” Association for Computing Machinery, Inc. June 2004, PeerPressure troubleshooting system uses statistics from a large sample set of machine configuration snapshots to identify a healthy machine configuration to compare a failing one and evaluate its performance. It leverages only the Windows Registry configuration data. The success rate may be reduced due to heterogeneity issues and false positives on healthy snapshots. The present disclosure, on the other hand, uses for system snapshots or the like, for comparing system checkpoints.
These existing techniques suffer from one or more of the following shortcomings: (i) address the detection of problems for a particular application or product, they are not applicable to the case when the application or product is a part of a complex distributed system; (ii) do not provide problem resolution, they do not attempt to go beyond problem determination; (iii) focus on runtime problem determination only while the error may be an installation error; (iv) look for changes in a limited set of data, not in the whole checkpoint snapshot.