Before presenting our transformation, we introduce background on nonblocking shared data structures, nonblocking progress conditions, and asynchronous and semisynchronous models of computation, and briefly describe some previous results that use semisynchronous models to analyze implementations.
Nonblocking Shared Data Structures
Today, almost all concurrent programs rely on blocking constructs such as mutual exclusion locks for synchronizing access to shared data structures. The use of locks introduces numerous problems, including deadlock, performance bottlenecks, and priority inversion. Researchers have investigated nonblocking implementations in the hope of eliminating these problems.
An implementation of a shared data structure in a shared memory system provides a representation of the data structure using base objects in the system and provides techniques for the processes of the system to perform operations on the data structure.
Most nonblocking applications are based on an optimistic approach to synchronization, in which an operation is attempted but may fail to take effect if another concurrent operation interferes. In this case, the operation is retried. A significant source of difficulty is guaranteeing that an operation is not retried repeatedly without ever completing. Generally, stronger nonblocking progress guarantees are more difficult to achieve, and require techniques that are more complicated and more expensive.
Nonblocking Progress Conditions
A wait-free implementation guarantees that when a process performs an operation, it completes the operation in a finite number of its own steps, regardless of how fast or slowly other processes execute, and even if they stop executing permanently. (See M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124-149, January 1991.) Such strong progress guarantees are attractive, but often very difficult to achieve. Most wait-free techniques in the literature are too complicated and too expensive to be useful in practice.
A lock-free implementation guarantees that, starting from any state in which one or more processes are executing operations, some process will complete its operation within a finite number of steps. This weaker progress condition usually makes lock-free implementations easier to design than wait-free ones. Simple and practical lock-free implementations have been achieved for a small number of important data structures, such as stacks, queues, and work-stealing deques. Lock-freedom has generally been regarded as acceptable because well known contention management techniques such as backoff are effective at reducing contention when it arises, thereby achieving progress in practice, despite the lack of the strong theoretical guarantee of wait-freedom. (See A. Agarwal and M. Cherian. Adaptive backoff synchronization techniques. In Proceedings of the 16th International Symposium on Computer Architecture, pages 396-406, May 1989.)
Herlihy, Luchangco, and Moir [Herlihy2003A] recently proposed the obstruction-free approach to implementing nonblocking operations for shared data structures. An obstruction-free implementation simply guarantees that a process will complete its operation if it eventually executes enough steps without interference from other processes. Thus, if two or more processes repeatedly interfere with each other, it is possible that none of them completes its operation. The view is that, because contention management techniques are required to achieve acceptable performance when contention arises anyway, it is unnecessary to make any progress guarantees in the case of contention between concurrent operations.
Several examples in the literature suggest that by providing only obstruction-free progress guarantees, significantly simpler implementations can be achieved that are faster in the uncontended case. Furthermore, although an implementation that is obstruction-free but not lock-free will exhibit livelock if contention is ignored, experience shows that livelock can be effectively avoided by using simple contention-management strategies.
Asynchronous and Semisynchronous Models of Computation
Concurrent techniques are usually required to ensure safety properties regardless of how the steps of concurrent processes are interleaved, and (therefore) regardless of how fast or slowly any process executes. In other words, these techniques should be proved safe in an asynchronous model of computation, in which the steps of processes are scheduled by an adversarial scheduler that can perform many steps of a process consecutively or perform them arbitrarily far apart. In such a model, it is impossible for a process to determine whether another process has crashed (i.e., stopped executing) or is just running very slowly.
Of course, in reality, there are limits to how fast or slowly processes can run. Some techniques exploit assumptions about these limits to improve in various ways on techniques designed for an asynchronous model. Such techniques are analyzed in synchronous or semisynchronous models of computation that embody timing assumptions made about the target execution environment.
In a synchronous model, all processes execute steps at the same rate (until they crash). This means that if a process does not perform a step when it should, other processes can detect that it has crashed. However, if the correctness of a particular technique depends on all (non-crashed) processes performing steps precisely at a given rate, then tiny variations in execution rate, for example due to one processor becoming warmer than another, can cause incorrect behavior. Consequently, such techniques are not generally practical.
Semisynchronous models relax these timing requirements, allowing processes to execute steps at different rates, and even allowing the rate at which a particular process executes to vary over time. However, it is assumed that there is an upper bound on the relative execution rates of any pair of processes. To be more precise, let us define the maximum step time of an execution as the longest time between the completion times of consecutive steps of any process. We define minimum step time analogously. Semisynchronous models assume that there exists a finite R such that in all executions, the ratio of the maximum and minimum step times is at most R. The evaluation of techniques in semisynchronous models has value for the practitioner because real-world systems satisfy the assumptions of such models, and for the theoretician in understanding the limitations of assumptions on timing.
In the known-bound model, R is known by all processes. (See [Alur1997] R. Alur, H. Attiya, and G. Taubenfeld. Time-adaptive algorithms for synchronization. SIAM Journal of Computing, 26(2):539-556, 1997.) This implies that a process can wait long enough to guarantee that every other process has taken another step, or has crashed. Some techniques that depend on knowledge of R can violate safety requirements in systems that do not satisfy the assumed bound. Conservative estimates of the bound for a particular system generally translate into worse performance, so designers are faced with a dangerous tradeoff in using such techniques. Thus, such techniques are not easily portable and indeed may violate safety properties in a given system if the system stops satisfying the timing assumptions, for example due to increased temperature.
In the unknown-bound model [Alur1997], R is not known to processes. Thus, in contrast to the synchronous and known-bound models, a process does not know how long to wait to ensure that every other process that has not crashed takes a step. Therefore, it is not possible for a process to detect that another process has crashed. Nonetheless, it is possible for applications to wait for increasingly longer periods, and to exploit the knowledge that eventually all non-crashed processes have taken a step during one of these periods. It has been shown that an application that is correct in this model does not violate any of its safety properties even in an asynchronous model, although progress properties proved in the unknown-bound model may not hold in an asynchronous model.
Applications that are correct in an asynchronous model are nonetheless sometimes analyzed in a synchronous or semisynchronous model, thus allowing the analysis to depend on various timing assumptions. Because contention management techniques such as backoff fundamentally rely on operations waiting for some time before retrying, they cannot be meaningfully analyzed in an asynchronous model of computation, which has no notion of time whatsoever.
In this description, we show how to transform any obstruction-free implementation into one that guarantees that every process performing an operation eventually completes the operation, when analyzed in the unknown-bound model. Thus, the resulting application is safe to use in any non-real-time application, and guarantees that every operation eventually completes in any practical system.
Our Transformation
We begin by explaining some simple ways of ensuring progress for each operation under various different assumptions and models. These ideas motivate the techniques used in our technique, and explain why they are needed under the weak assumptions of the unknown-bound model.
First, if we assume that processes never crash, then it is easy to ensure progress, even in an asynchronous model. This is achieved by ordering operations using timestamps, and having each process wait until all earlier operations in this order have completed before performing the steps of its own operation. This ensures that operations do not encounter contention with concurrent operations while executing the original obstruction-free technique, so every operation eventually completes. However, if a process does crash while it has a pending operation, no operations with later timestamps can be executed.
In a synchronous model, if all processes know an upper bound B on the number of consecutive steps that must be taken by a process to ensure that its operation completes, then it is easy to guarantee that each operation completes, even if processes can crash. The idea is, again, to order operations using timestamps and to have processes refrain from executing their operations while operations with earlier timestamps are pending. However, unlike in the asynchronous model, a process can detect if another process crashed while executing its operation: if the operation is not completed within B steps, then the process executing it must have crashed. In this case, a process can execute its operation when every operation with an earlier timestamp has either completed, or will not interfere further because the process executing it has crashed.
A similar approach works in the known-bound model. In this case, a process that is waiting for an earlier operation than its own to complete must conservatively assume that it is executing its steps at the maximum speed allowed by the model relative to the speed of the process executing the earlier operation. Thus, in this model, a process must wait for RB steps in order to be sure that another process has had time to execute B steps, where R is the ratio of the maximum and minimum step times.
However, this technique does not work in the unknown-bound model because the bound R is not known to processes. In fact, in this model, it is impossible for one process to determine that another process has crashed. Nonetheless, ideas similar to those described above can be used to guarantee that each operation executed by a process that does not crash will complete even in the unknown-bound model. The key idea is that, rather than delaying for an amount of time that is known to be long enough to allow another process to take B steps, a process can delay for increasingly long periods of time while an earlier operation has not completed.
Each time a process performs b steps of its operation, for some constant b, it increases a counter. This serves the dual purposes of demonstrating that it has not crashed, and therefore must be deferred to by later operations, as well as increasing the number of steps for which later operations must defer. After a process has waited the required number of steps for an earlier operation that has not been completed and whose counter has not been increased, it assumes that the process performing the earlier operation has crashed. Consequently, it removes the timestamp of that operation from the order of operations under consideration and proceeds.
In case the process executing the earlier operation has, in fact, not crashed, it reinstates its operation into the order (using its original timestamp). With this arrangement, if a process does crash while executing an operation, then it is removed from consideration and does not prevent progress by other operations. On the other hand, if an operation fails to complete because others did not wait long enough, then they will wait longer next time, so the bound provided by the model ensures that eventually they will wait long enough and the operation will complete.
It is important to note that the worst-case bound R for a particular system might be very high, because a process might occasionally take a very long time between two steps. However, the application has no knowledge of the bound, so the bound does not affect the performance of the application; only the particular execution behavior does. Furthermore, even if an unlikely sequence of events causes progress to take a long time, this has no bearing on how the application behaves in the future. In practice, processes run at approximately the same speed most of the time. Therefore, the effective bound will generally be small, even if, in theory, the actual bound is very large.
This description captures the key idea about how we transform implementations to provide progress guarantees in the unknown-bound model. However, because this strategy essentially amounts to eliminating concurrency, it would not be practical if simply used as described. Therefore, our transformation does not employ this strategy until some process determines that it has executed the original operation too long without making progress.
The application produced by applying our transformation to an obstruction-free application OFApp (which may include a contention manager) is shown in Table 1. We now describe the transformed application in more detail. The PANIC flag is used to regulate when the strategy to ensure progress should be used. Referring to both Table 1 and the flow chart illustrated in FIG. 2, when a process invokes an operation, it first checks this flag (N1; step 204) and, if it is not set (i.e., has the value FALSE), executes up to B steps of its original application (N2; step 208), where B is a parameter of the transformation. If these steps are sufficient to complete its operation, the process simply returns (N3-N4). Observe that if every operation completes within B steps, then the PANIC flag remains unset (FALSE), so the transformed application behaves exactly like the original one, except that it must read one variable, which is likely to be cached. Thus, by choosing B appropriately, we ensure that our transformation introduces very little overhead, if the original contention manager is effective.
TABLE 1invoke(op)N1:if   PANICN2:  execute up to B steps of OFAppN3:  if op is completeN4:    return responseN5:  PANIC   TRUE//panic modeP1:t   fetch-and-increment(C)P2:A[i]   1repeatP3:  T[i]   t  //find minimum time stamp; reset all othersP4:  m   t  k   iP5:  for each j ≠ iP6:    s T[j]P7:    if s < mP8:      T[k]   ∞P9:      m   s      k   j    elseP10:      if(s<∞) T[j]   ∞P11:  if k = i    repeatP12:      execute up to b steps of OFAlgP13:      if (op is complete)P14:        T[i]   ∞P15:        PANIC   FALSEP16:        return responseP17:      A[i]   A[i]+ 1P18:      PANIC   TRUEP19:    until (T[i] = ∞)  else    repeatP20:      a   A[k]P21:      wait a stepsP22:      s   T[k]P23:    until a = A[k] or s ≠ mP24:    if (s = m) T[k]   ∞P25:until (op is complete)
If its operation fails to complete within B steps, the process sets the PANIC flag (to have the value TRUE) (N5; step 206). Thereafter, until the flag is reset, all new operations see that the PANIC flag is set (to have the value TRUE) and begin to participate in the strategy to ensure progress (P1-P25).
A process pi participating in this strategy first acquires a timestamp (P1; step 214), initializes its activity counter A[i] (P2; step 216), and then repeats loop P3-P25 until its operation is complete. In each iteration of this loop, the system first determines which process pk has the minimum time stamp and sets all other time stamps to ∞ (step 218). To accomplish this, each process pi announces its timestamp in T[i] (P3) and then searches for the minimum (i.e., oldest) timestamp announced by any process. All timestamps that are not ∞, but are larger than the minimum timestamp it observes, are replaced by ∞ (P4-P10).
If pi determines that it has the minimum timestamp (P11), then it repeatedly takes up to b steps of the original application (P12; step 222) (where the constant b is a parameter of the application), increases its activity counter (P17; step 225), and sets the PANIC flag (to have the value TRUE) (P18; step 226). Note that the PANIC flag may have been unset (to have the value FALSE) because some other process completed its operation (P15). Setting the PANIC flag (to have the value TRUE) ensures that new operations continue to participate in the strategy to ensure progress. Process pi repeats these steps until either its operation finishes (P13-P16) or some other process overwrites its timestamp with ∞ (P19). The latter case indicates that this other process has read an older timestamp (P8, P10) or thinks that pi may have crashed (P24). On the other hand, if pi finishes its operation, it sets T[i] to ∞ (P14; step 227), unsets the PANIC flag (to have the value FALSE) (P15; step 228), and returns from the invocation.
If process pi determines that some other process pk has the minimum timestamp (P11), then pi enters loop P20-P23. During each iteration of this loop, pi reads pk's activity counter A[k] (P20) and waits for the indicated number of steps (P21; step 230). If pk's timestamp is overwritten during this period of time, then either pk has completed its operation, another process thought that pk had crashed, or another process saw an operation with a smaller timestamp. In this case, pi exits the loop (P23). If pk's timestamp is not overwritten by another value and pk does not increase its activity counter during the period of time while pi was waiting at line P21, then pk may have crashed, so pi exits the loop (P23) and overwrites pk's timestamp with ∞ (P24; step 232). The system then returns to step 218. In doing so, the process pi sets T[i] again to have its time stamp. Note that each process remembers its time stamp, and this memory is not wiped out when other processes set T[i] to infinity. Process pi uses this “remembered” time stamp to restore its timestamp T[i] when process pi transitions from step 232 to step 218.
We now informally explain why our strategy ensures a guarantee of progress. Eventually, in every iteration of loop P3-P25, process pi enters loop P12-P19. Meanwhile, other processes determine that pi's timestamp is the minimum and wait for a number of steps indicated by pi's activity counter A[i]. If pi doesn't complete its operation within b steps, then it increases its activity counter A[i]. Eventually, no process resets T[i] to ∞, and A[i] becomes large enough so that each process executing loop P20-P23 waits long enough at P21 so that pi increases A[i] during this period. Thus, eventually, all other active processes remain in loop P20-P23, so no process except pi executes steps of the original application. Hence, obstruction freedom guarantees that pi eventually completes its operation.
On the other hand, if pi crashes, then the other processes will no longer see A[i] change, will stop waiting for pi, and will overwrite its timestamp with ∞. Then the way is clear for the next operation in timestamp order (if any) to make progress.
An important feature of the transformed implementation is that, if the original contention manager is occasionally ineffective, causing the PANIC flag to be set, the PANIC flag will be reset and normal execution will resume, provided the original contention manager does not remain ineffective. To see this, recall that every operation by a non-crashed process eventually completes, and note that each operation either sees that PANIC is unset (has the value FALSE) and does not set it (N1), or unsets PANIC (to have the value FALSE) before returning (P15). Furthermore, PANIC is set (to have the value TRUE) only by an operation that has executed either B or b steps of the original application (including the original contention manager) without completing. Thus, with appropriate choices for B and b, we ensure that our mechanism continues to be invoked only if the original contention manager continues to be ineffective.
Note that a number of aspects of the exemplary code that appears in Table 1 can be modified. For example, the choice of specific values for B and b can depend on the particular operation being executed (and the process executing it). Hence, the parameters B and b do not have to be a constant parameter of the transformation. The only requirement is that there is some upper bound for B over all operations. Note that a process can raise the PANIC flag earlier if it finds that it is encountering too much contention, so B is just a worst-case bound, and should be chosen conservatively. Furthermore, it is desirable for the value of b to be constant for each operation. That is, once b is chosen for a particular operation, it should use the same value of b throughout. Though again, b is an upper bound; the process may choose to execute fewer steps of OFAlg, as long as it executes at least one step.
We can also allow A[i] to be initialized (on line P2) with any value, and the choice of what value to use may depend on the operation. Also, at line P17, the process may increase its activity counter by any amount, as long as it increases it. (In the case where the activity counter is an integer, the process has to increase by at least 1.)
Concluding Remarks
We have shown that any obstruction-free application can be transformed into a new application that is wait-free when analyzed in the unknown-bound semisynchronous model of computation. Our transformation can be applied to an obstruction-free implementation, together with any valid contention manager, and the transformed implementation behaves like the original as long as the chosen contention manager is effective. Because real-world systems satisfy the assumptions of the model we consider, our result shows that obstruction-free applications and ad hoc contention managers can be used in practice without sacrificing the strong progress guarantees of wait-freedom.
Our result can easily be made stronger from both practical and theoretical points of view. First, as presented, our transformation introduces the need to know of the maximum number of processes that use the implementation. However, this disadvantage can easily be eliminated using results of Herlihy, Luchangco and Moir. (See M. Herlihy, V. Luchangco, and M. Moir. Space- and time-adaptive nonblocking algorithms. In Proceedings of Computing: The Australasian Theory Symposium (CATS), 2003.) From a theoretical point of view, our use of the fetch-and-increment can be eliminated by using standard time stamping techniques based on an array of single-writer-multiple-reader registers. Thus, our transformation is applicable in a wide range of shared memory systems, as it does not depend on any special support for synchronization.
Furthermore, because we can replace the fetch-and-increment with time-stamping techniques based on registers, it is possible to apply our transformation to truly distributed systems without shared memory. Note that existing techniques make it possible to implement shared memory on a distributed, message-passing system as long as not more than half the processes are faulty.
The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.