A clustered computing environment typically includes a large set of remotely connected computing nodes. Often each node has a set of independent compute, memory and storage resources interconnected by a shared network and often utilizing shared resources, such as a file system, a database or other component. A cluster of computing nodes may be utilized to service requests submitted by one or more software applications. Each software application may be composed of several threads that utilize one or more resources to perform certain tasks.
A resource utilized by a thread in a clustered computing environment may be a physical or a virtual computational entity (e.g., a host machine or a virtual machine) of limited availability which may or may not be immediately available for access by a thread. Since the successful execution of the thread depends on various resources, if a resource utilized by the thread becomes unreachable, unavailable, or otherwise non-functional, the thread may not be able to complete an assigned task until the particular resource on which the thread depends becomes available again.
Examples of resource dependencies include dependency on results generated by other threads, shared data stored in a database or file system which may be needed for the thread to complete the execution of a task, etc. As such, the execution of a task may not be successfully completed by a thread, if for example a different thread which is called synchronously and depends on other currently unavailable resources fails to respond, or if a database cannot lock a required resource that needs to be accessed by the thread, or if shared storage on which the thread depends to store data is full, or if a computing node on which the thread is running has insufficient memory to support the completion of a task.
In the above scenarios, a thread may not be able to complete the task in a timely expected manner, but the thread, given some time, may be able to complete the task once the missing, delayed or failed resources become available. Depending on implementation, some systems may be designed so that a thread can provide a guarantee to a higher level component indicating that a currently suspended task will be completed in the future, so that the other system components may continue to properly function and operate without having to wait for the particular task to be completed.
The point in time in which a thread guarantees to complete a target task when resources become available is sometimes referred to as a roll forward point. A management mechanism needs to be in place to ensure that the target task will be completed by the responsible thread. State of the art mechanisms used for this purpose typically are unable to provide or implement a highly reliable or highly available and efficient roll forward point mechanism that is configurable or scalable to support the desired levels of replication for the target task.