1. Field of the Invention
The present invention relates, in general, to multi-threaded program execution, and, more particularly, to software, systems and methods for barrier synchronization in multi-threaded applications.
2. Relevant Background
An executing software application comprises one or more “processes” where each process is relatively independent of other processes. In general, each process is allocated its own resources, data structures, memory, and the like so that it executes as an atomic unit with little risk of interfering with other processes and little risk of being interfered with by other processes. The collection of computational resources allocated to a process is referred to as “context”. In environments where the context can be dynamically switched, multiple processes can run concurrently creating an effect similar to multiple programs running simultaneously. Additionally, by breaking a complex software application down into multiple independent processes the resulting application is often easier to design and implement, and a more robust application results. However, switching between processes requires a significant amount of overhead as processing resources and memory are de-allocated from one process and re-allocated to a new process.
Computer environments often support processes with multiple threads of execution (i.e., threads) that can work together on a single computational task. The term “thread” in a general sense refers merely to a simple execution path through application software and the kernel of an operating system executing with the computer. Multithreading is a technique that allows one program to do multiple tasks concurrently by implementing each task as a separate thread. Threads share an address space, open files, and other resources but each thread typically has its own stack in memory. One advantage of using threads instead of a sequential program is that several operations may be carried out concurrently, and thus events can be handled more efficiently as they occur. Another advantage of using a thread group over using multiple processes is that context switching between threads is much faster than context switching between processes. Also, communication between threads is usually more efficient and easier to implement than communications between processes.
Threads typically execute asynchronously with respect to each other. That is to say, the operating environment does not usually enforce a completion order on executing threads, so that threads normally cannot depend on the state of operation or completion of any other thread. One of the challenges in using multithreading is to ensure that threads can be synchronized when necessary. For example, array and matrix operations are used in a variety of applications such as graphics processing. Matrix operations can be efficiently implemented by a plurality of threads where each thread handles a portion of the matrix. However, the threads must stop and wait for each other frequently so that faster threads do not begin processing subsequent iterations before slower threads have completed computing the values that will be used as inputs for later operations.
Barriers are constructs that serve as synchronization points for groups of threads that must wait for each other. A barrier is often used in iterative processes such as manipulating an array or matrix to ensure that all threads have completed a current round of an iterative process before being released to perform a subsequent round. The barrier provides a “meeting point” for the threads so that they synchronize at a particular point such as the beginning or ending of an iteration. Each iteration is referred to as a “generation”. A barrier is defined for a given number of member threads, sometimes referred to as a thread group. This number of threads in a group is typically fixed upon construction of the barrier. In essence, a barrier is an object placed in the execution path of a group of threads that must be synchronized. The barrier halts execution of each of the threads until all threads have reached the barrier. The barrier determines when all of the necessary threads are waiting (i.e., all threads have reached the barrier), then notifies the waiting threads to continue.
A conventional barrier is implemented using a mutual exclusion (“mutex”) lock, a condition variable (“cv”), and variables to implement a counter, a limit value and a generation value. When the barrier is initialized for a group of threads of number “N”, the limit and counter values are initialized to N, while the variable holding the generation value is initialized to zero. By way of analogy, using a barrier is akin to organizing a group of hikers to wait at a particular place (e.g., wait at the Long's Peak trailhead) until a certain circumstance has occurred (e.g., until all hikers have arrived). The cv is essentially that name of the place at which each of the threads wait, but is not otherwise manipulated by the threads using the barrier. The limit variable represents the total number of threads while the counter value represents the number of threads that have previously reached the waiting point.
A thread “enters” the barrier and acquires the barrier lock. Each time a thread reaches the barrier, it checks to see how many other threads have previously arrived by examining the counter value, and determines whether it is the last to arrive thread by comparing the counter value to the limit. Each thread that determines it is not the last to arrive (i.e., the counter value is greater than one), will decrement the counter and then execute a “cond_wait” instruction to place the thread in a sleep state. Each waiting thread releases the lock and waits in an essentially dormant state.
Essentially, the waiting threads remain dormant until signaled by the last thread to enter the barrier. In some environments, threads may spontaneously awake before receiving a signal from the last to arrive thread. In such a case the spontaneously awaking thread must not behave as or be confused with a newly arriving thread. Specifically, it cannot test the barrier by checking and decrementing the counter value.
One mechanism for handling this is to cause each waiting thread to copy the current value of the generation variable into a thread-specific variable called, for example, “mygeneration”. For all threads except the last thread to enter the barrier, the mygeneration variable will represent the current value of the barrier's generation variable (e.g., zero in the specific example). While its mygeneration variable remains equal to the barrier's generation variable the thread will continue to wait. The last to arrive thread will change the barrier's generation variable value. In this manner, a waiting thread can spontaneously awake, test the generation variable, and return to the cond_wait state without altering barrier data structures or function.
When the last to arrive thread enters the barrier the counter value will be equal to one. The last to arrive thread signals the waiting thread using, for example, a cond_broadcast instruction which signals all of the waiting threads to resume. It is this nearly simultaneous awakening that leads to the contention as the barrier is released. The last to arrive thread may also execute instructions to prepare the barrier for the next iteration, for example by incrementing the generation variable and resetting the counter value to equal the limit variable. Expressed in pseudocode, the above steps may be represented as shown in Table 1.
TABLE 1Initialize barrier for N thread usage  counter=N/*N threads in group*/  limit=N/*N threads in group*/  generation=0wait  acquire lock  if counter= =1/*detect last to arrive thread*/    generation ++/*prepare for next iteration*/    counter=limit/*prepare for next iteration*/    cond_broadcast/*awaken waiting threads*/  Else/*copy generation variable*/  mygeneration=generation/*decrement counter*/  counter−−  while mygeneration= =generation    cond_wait/*wait until next iteration*/  release lock
Before leaving the barrier, each of the awakened threads must acquire the barrier's lock, however, only one thread can own the lock at any time. The awakened threads will attempt to acquire the lock as many times as necessary. Because they are all trying to acquire the lock concurrently, most of the threads will have to make multiple attempts to acquire the lock. After each failed attempt, the thread will go back into a wait state, idle for several clock cycles, then attempt to reacquire the lock. When a large number of threads are using a barrier (e.g., more than eight threads), the delay incurred by the last to leave thread can be significant.
When exiting the barrier, the threads have been explicitly synchronized and so contention for the mutex lock necessarily exists. Consider when there are N threads in a group, although one thread will leave the barrier on the first attempt, each other thread will be required to make multiple attempts. The last thread to leave the barrier will have made N−1 attempts before it is able to acquire the mutex lock and leave the barrier. In some cases, the first thread or threads to leave the barrier may complete the next iteration and arrive back at the barrier before all of the previous generation threads have managed to leave the barrier. While this “lapping” phenomena can be controlled with appropriate coding, it demonstrates the limitations of conventional barrier structures.
When the number of threads using a barrier becomes large, a single mutex becomes a critical resource. As the number of threads grows, the overhead created by this contention increases non-linearly and can negatively affect performance. As a result, conventional barrier implementations do not scale well. This contention has a negative impact on application performance as time and processor resources are consumed in arbitrating for control over the mutex lock rather than executing application programs. Therefore, a need exists for an efficient method and apparatus for synchronizing threads.
A semaphore is another type of synchronization construct. A semaphore is typically implemented as a mutex, condition variable and counter. A semaphore is used to manage access to a limited resource. A physical example of a semaphore is a queue at a bank waiting for a teller. When there are X tellers, there cannot be more than X customers served concurrently. The duration of any particular service request is variable. In order to get service you need one of the X tellers. In the following example assume that the number of tellers (X) is 2. The first customer to arrive at the bank notices that there is a free teller and begins a transaction. Before the transaction completes a second customer arrives and notices that there is a free teller and begins a transaction. Before the first and second customers are serviced a third customer arrives and notices that there are no free tellers. The customer waits in queue. At this point it does not make any difference which of the two tellers becomes available first, the first available teller will service customer 3. If a fourth customer arrives before the first two customers are serviced, they will wait in queue with customer 3.
The semaphore consists of three operations: initialize, post and wait. The initialize operation sets the initial value of the semaphore, in the teller example the value would be two (2). Note that this is only the initial value of the semaphore. The value of the semaphore can never be less than zero (0). The post operation increases the value of the semaphore by one (1) and wakens a thread (via a cond_signal) if the value of the semaphore was zero (0). A wait operation will test the value of the semaphore, if the value is zero (0) the thread will block waiting (via a cond_wait) for it to become non-zero. If the value of the semaphore is non-zero, the thread decrements by one (1) and continues.
Returning to the teller example, another teller (e.g., a third teller in the particular example) may open to service customers (due to detection of a long queue). In the semaphore case, this would be analogous to a post operation. A teller may also close if they detect that there are idle tellers (e.g. too few customers). In the semaphore case this would be analogous to a wait operation. Note that in these two examples the post and wait are performed by the tellers (i.e., resources) and not by customers (i.e., consumers).
The only place where the analogy is not strong is that a semaphore with waiters does not implement a queue, instead it is the “free-for-all” approach. When a post operation occurs on a semaphore that was zero (0) and has waiters, the waiting threads are woken and can attempt to acquire the resource.