Complex distributed systems and applications include software running on multiple servers and may include integrated products or software libraries from multiple vendors. Components of the system are located on networked computers that interact and coordinate their actions to work properly. The system may also interact with third party systems to complete certain transactions.
For example, consider an online shopping application where a customer selects items to purchase and initiates a “submit order” action to buy the items. To the customer, the process appears simple but behind the scenes, the process can be very complex. Many function calls may be made to third party systems and/or calls to different types of systems with multiple technologies. There may be a function call to a credit card system to verify the customer's credit card and available funds, a call to an inventory database to verify that the purchased items are in stock, a call to a shipping system to get shipping information for delivering the purchased items, and other calls needed to process the order. All of the actions performed to complete the “submit order” are referred to as a business transaction. The individual systems and the online shopping application need to work together to properly process the business transaction.
Sometimes an error occurs somewhere during the business transaction causing the transaction to fail. The distributed structure of the system makes it difficult to locate and identify the root cause of the error. In computing, certain kinds of errors or other failures generate “exceptions.” Exceptions are conditions or events that disrupt the normal flow of executing instructions in a software application and are a common part of many computing environments. Often, the root cause of an environment, configuration or security issue, or a bug or unsupported use in a piece of program code manifests itself in the form of an exception. When an exception occurs during runtime, it can cause unpredictable effects, such as a failed business transaction, or cause a transaction to take longer/shorter than usual to complete.
Exceptions in distributed applications can also take the form of system or application errors (for example, invalid data in requests, transport-level errors, network failure, inaccurate responses) or business errors (for example, excessive weight of shipment, bad credit for a premier customer). Unfortunately, it is usually the customer (the consumer/user of a distributed application) who experiences exceptions before anyone within the enterprise. Common examples include generic messages on an e-commerce Website (e.g., “Sorry, unable to process request at this time”), delayed orders, and/or lost packages.
Each occurrence of an exception can disrupt the customer's experience up front and may have a direct impact on the business. Therefore, managing exceptions proactively is important to any business.
Typically, programmers attempt to handle foreseeable exceptions by writing program code in the application that performs certain actions when the exception occurs during runtime. Such code is called an exception handler. However, the manner in which the exception is handled is dependent to how well the programmer writes the exception handling code. Sometimes an exception does not get logged by the respective code at all. Poorly handled exceptions become very difficult to identify when they occur during runtime, which makes it difficult to locate and identify the root cause of the exception. Even with well handled exceptions, a challenge is that they often end up scattered in multiple log files with varying hard to correlate formats, spread out on different machines. This makes it difficult to know that an exception happened. Even when it is known that an exception happened, it's hard to know where the exception was logged. Moreover, it's even harder to know the business transaction it happened as part of, and to determine its criticality and impact to the business. Typically, complex and time consuming debugging procedures are manually performed to identify the above. In some circumstances, the error conditions that caused the exception cannot be recreated, making it very difficult to identify and correct the error.