In distributed systems, it is extremely difficult to discover failures and recover from them. These systems typically log errors by capturing a stack trace and recording it to a file or data source. However, because most distributed systems lack task coordination, recovery is even trickier. In fact, without such coordination, when a failure does occur such systems cannot recover from the point at which the failure occurred in the workflow. Instead, failure recovery typically necessitates retrying the processing chain, which can be expensive depending on how much processing power is required.
FIGS. 1A-1C illustrate an example of this problem. In FIG. 1A, a processing chain is shown for execution by a distributed system. The processing chain has three units of work (A, B, and C) that need to be executed in sequence. FIG. 1B illustrates that when there is a failure while trying to transition from unit B to unit C, the contact points between units B and C in FIG. 1B are severed. The system modeled in FIG. 1A does not have a way to transition between units of work when a failure severs the processing chain. To advance from unit B to unit C in this system thus requires the expenditure of engineering resources to deduce the following information: (1) what processing occurs in unit C to determine the proper way to reposition the system such that unit C can run properly; and (2) what data is necessary coming out of unit C to put the system in its proper state thereafter.
Many distributed systems track their states implicitly by generating log or configuration files. In most cases, the information contained in such files is not actionable by the system during recover, and accessing this information would require the use of engineering resources to write additional executable code to do so. Even then, however, effectively utilizing this information requires the use of auditing, and auditing from inside this executable code lacks the richness of going through the normal processing chain. In this regard, reliance on executable code to access the information contained in these log or configuration files is extremely error prone, because this executable code usually does not record enough information to safely reposition the system based on this information, which in turn places a heavy reliance on the ability of engineers to understand the internal makeup of the system and stage data appropriately. Not only that, but this model requires several people working in unison upstream to deploy and verify each fix with customers. Recovering from failures can be a very tedious and costly process given that there may be multiple customers, multiple instances of failure, and multiple failure points within a given distributed system. Accordingly, FIG. 1C represents what recovery looks like in distributed systems that do not employ the concept of task coordination.
Due to this heavy reliance on manual effort for recovering from failures in this fashion, tracking of changes is usually sparse to non-existent. Thus, this traditional approach exposes an organization to the potential of violating Federal and/or State statutory or regulatory requirements regarding data integrity, such as those imposed by the Health Insurance Portability and Accountability Act of 1996 (HIPAA). Ad hoc approaches of this nature also expose the system to downstream issues because the data has not passed through normal processing points. Finally, these types of systems do not have a way to deliver actionable information to the user that may facilitate debugging. Finally, failure recovery of this nature does not empower users to recover from incidents unilaterally, and instead requires users to interface with customer support and/or other representatives of the organization managing the distributed system.