Many complex applications in use today rely on multiple software components communicating with one another to provide desired functionality. Tasks being carried out by a computer program may be divided and distributed among multiple software components. For example, multiple processes running on a single computer or multiple computers in electronic communication with each other may each carry out a portion of a task. For example, multiple programs (or processes) on multiple computer systems working cooperatively to carry out a task are provided in classic multi-tiered web application architecture. Each process may include a plurality of threads, each thread being a stream of instructions being executed by a computer processor. A process can have one or more threads executing in a common virtual address space. Each thread may have multiple subcomponents, such as executable objects (in object-oriented programming), subroutines, functions, etc.
Each component, which may be a separate program, a thread, a library or application programming interface (API), executable object, subroutine, or function, etc., is typically called by a calling component to perform some task, and itself (the called component) may rely on additional called components to perform certain subtasks. For example, an application may need to read a configuration file, and call a “file-read” component for opening and reading file contents of the configuration file into a data buffer. The file-read component, in turn, may call another process, e.g., via an API provided by the operating system (OS), to open the file. If the file does not exist, the file-open component of the OS may return an error code to the “file-read” component, which may then return a corresponding error code to the parent application component. In some cases, this chain of call commands, or “call-stack,” can be very long, spanning multiple threads and even, through remote procedure calls (RPCs), web service requests, etc., multiple computer systems. Identifying the root causes of errors in these long call-stacks can be very tricky for users such as developers, system administrators, and end users.
The poor quality of error logs and messages is a persistent problem for users, especially in the case of large or distributed programs having multiple parts distributed across multiple processes in a single system or across a network of physical or virtual computer systems. Typical error reporting in such systems may be vague or misleading, only describing a small part of the error phenomenon observed. In general, error messages fail to identify or make apparent the root cause of the problem and do not provide any remediation steps.
One common cause for poor error reporting may be referred to as translation loss, which occurs as an error is repeated up the call stack. For example, suppose a first component calls a second component that calls a third component. The third component returns an error code to the second component indicating a specific problem that arose, preventing it from completing its task. The second component receives the error code of the third component, and due to the failure of the third component, cannot complete its own task and therefore returns an error message to the first component, perhaps indicating a failure of the second component but not retaining the specific problem provided by the error code of the third component. Therefore, the specific failure known at the lower levels of the chain of components is lost as the return codes are translated and passed up the chain. At the highest levels of the chain, the error message may be so general as to provide no useful or actionable information.
Another common cause is lack (or loss) of instance information. Instance information is the specific data or parameters being passed to the component or accessed by the component when the error occurred. The instance information can also include context or state information of the component at the time the error occurred. For example, if the error was a connection failure, then instance data may include what entities were being connected and network parameters used in the connection attempt. Typical error reporting schemes do not retain such instance data, which would generally be very helpful in tracking down and correcting errors.
Another problem is the lack of a global view of the error: even if the user knew what caused the error in terms of the component that first observed the error and all the instance data surrounding it, this information may still not be useful without also knowing, for example, why that component was called in the first place, i.e., what the higher level components were and what they knew. That is, knowing that particular port failed may not be helpful without also knowing why the port was attempted to be opened in the first place or to whom it belonged. The higher-level components may have this information, but correlating the higher level information with the lower-level information has not been possible, particularly when the higher level information is held by different threads running on possibly different physical computer systems.
Another problem is the over-reporting of errors. For example, a result may be an “error” at one layer of the system but may not be an error at another layer. For instance, the filesystem of a kernel failing to find a file during an “open” call would be considered an error for the “file-open” function, but if the userlevel application expects the open to fail in some cases, such as for an optional configuration file, then the open failure is not considered an error by the userlevel application. This makes planning for and potentially enumerating all errors messages up front very difficult because the structure of the software code will greatly affect where in the code error messages should be generated.
A number of methods of creating and managing error messages are known, some of which attempt to overcome the aforementioned difficulties. The first (in no particular order) and most basic method of producing error messages is to create a static mapping between error code and error message. This is done in Unix environments through the “errno” variable, which contains the latest error code returned from a system call. The errno can then be converted by a mapping to a text string, which can be displayed to the user. Windows has a similar mechanism of standard error code to text string conversion. However, the set of error codes is typically small such that there are no specifics about the particular error given. Instead, just a general category of the error is provided.
The second error reporting scheme involves maintaining a central authority of error message (and even error code) creation. In this scenario, a single “error code book” is kept that maps each error code to a detailed description and remediation steps. However, unless extensive efforts are undertaken, this method often results in errors that are too generic to be useful in addition to the high overhead of maintenance.
Third, attempts have been made to link a software crash to a knowledge base (KB) article through the use of the symbolic backtrace of the crash. The symbolic backtrace includes function addresses and arguments stored on the stack. However, this approach is only useful if a crash occurs, and then only if there is already a KB mapping in place. Without the KB, it is very difficult for the average user to glean information from the symbolic backtrace. There is no easily accessible instance information included in the symbolic backtrace. This means that the KB has to be somewhat generic. Also, symbolic backtraces for the same root cause may slightly differ (e.g. may have an extra function or two in them), meaning that a given backtrace may not be matched successfully to a KB that describes its root cause even if such a KB exists. Symbolic backtraces are also easily obsolesced, e.g., when a new version of a particular application or subcomponent (such as a dynamically linked library) is released.
Fourth, some companies build applications that analyze log output from an application and try to do correlation and derive information from those logs. Two examples of these are Splunk™ and EMC's Smarts™. Splunk™ has a generic framework for analyzing text log output and specific plug-ins for various software applications. It uses regular expressions and a rule engine to understand the text log output and add that data into its framework and cross-reference it, at which point the user can search for information in it. Smarts™ polls error paths to try and track errors back to their root cause. In general, these applications are built to compensate for deficiencies in the observed application. In addition, there are commonly lots of application-specific error conditions that the developers of that application know about but do not make external in any way, thus limiting the ability of these tools.
Fifth, in some approaches human readable error messages are collected on a stack of messages that are eventually given to the user or cleared if a function handles the error. The set of error messages are displayed to the user who must determine the root cause from the set of messages displayed.
Faced with the limitations of current approaches, users typically rely on a number of external factors. They may contact the technical support of the application provider, possibly incurring additional cost to the provider. User group sites are set up on the Internet to share ad-hoc solutions to common problems. Intra-company best practices are set up in advance to try and predict the possible problems so that there are ready solutions. In short, software users spend an inordinate amount of time and effort to compensate for the low quality of error messages in software applications today.