A thread is a process or basic unit of program execution that carries out a certain program function. Threads are a way for a program to split a process into two or more sub-processes that can be run simultaneously. Certain operating systems can run multiple threads of a multi-threaded program in parallel on a number of different processors allowing for fast execution of the program. Consider, for example, a program employed to numerically evaluate a definite integral ∫ab(f(x)+g(x))dx, where f(x) and g(x) are continuous on the interval (a, b). The program can split the evaluation into two threads, where the first thread is run on a first processor to evaluate the integral ∫abf(x)dx, and the second thread is run simultaneously on a second processor to evaluate the integral ∫abg(x)dx. The results of the first and second threads are summed to obtain the value of the integral ∫ab(f(x)+g(x))dx.
Multi-threaded programs are intrinsically composed of multiple concurrent threads of control. In order to achieve the intended behavior, two problems must be solved. One problem involves spatial synchronization of system resources, whereby mutually exclusive access of threads to certain shared resources may be guaranteed. This problem can be solved with locks. A lock is a synchronization mechanism for ensuring that there is, at any one time, no more that one thread using or modifying a resource in an environment where several threads of execution must use or modify several shared resources. The other problem involves temporal synchronization of thread phases. This problem is typically solved with barriers. In parallel computing, barriers are employed to synchronize execution of a set of threads. The processing of each thread in the set must stop at the barrier and cannot continue until all of the threads have reached the barrier. In other words, the barrier ensures that all participating threads must reach the barrier before any thread may advance past the barrier. Since a barrier is most often a form of global synchronization, the efficiency with which the barrier is processed extends the program's critical path. Large delays between the time the last thread reaches the barrier and when all threads receive notification that the barrier has been released can profoundly lengthen a program's execution time. There are numerous issues which exacerbate the problem. For example, complex programs may have many barriers operating at any given time, and as the number of threads and thread sets increases, the number of barriers and the number of individual barrier participants increases. Any shared physical resources that support these barriers will further increase barrier processing latency by reducing the number of barriers that the hardware can support in parallel. Hardware-based barrier solutions may only support a limited number of simultaneously outstanding barriers, which may further increase barrier processing latency. Additional problems due to the global nature of barrier processing involve the delay and the power consumed in signaling over long signal paths.
Software-based barrier solutions can be used to synchronize thread phases in a number of different applications, especially those that employ counters. However, their performance ultimately depends on the underlying computer architecture's structure and behavior. Physical components, such as the communication network and memory hierarchy, and architectural policies, such as cache coherence and memory consistency, can play a large and often unpredictable role in the performance of software-based barriers.
On the other hand electrical hardware-based solutions requiring a dedicated barrier interconnect-network have been proposed and built. See “Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters,” by S. Shang, IEEE Trans on Parallel and Distributed Systems, Vol. 6, No. 6, June 1995. Binary trees and AND gates often perform better than their software counterparts. Each leaf of the tree corresponds to a processing element (“PE”), and the expense of building the tree topology, either embedded in another existing interconnect or in an entirely separate interconnect, can be cost prohibitive.
An optical-based barrier solution has been proposed as an alternative to electrical hardware-based barrier solutions because lightwaves suffer from significantly less signal loss and distortion over longer distances than do electrical signals. For example, in “A Distributed Hardware Barrier in an Optical Bus-Based Distributed Shared Memory Multiprocessor,” by M. H. Davis Jr. and U. Ramachandran, Proc. 1992 Int'l Conf. Parallel Processing, Vol. 1 pp. 228-231, August 1992, the authors proposed that each PE broadcasts onto a designated optical waveguide bus when it reaches a barrier. All participating PEs snoop the buses, using the PE-to-bus assignment to appropriately identify a PE and fill in a PE-private electrical barrier trees. However, the hardware requirements of this approach may be prohibitive. For a given barrier instance, each PE requires its own barrier tree and broadcast waveguide and the barrier function is performed electronically.
Electrical engineers and computer scientists have recognized a need for optical-based barrier solutions that reduce the delays between the time the last thread reaches a barrier and when all processing elements receive notification that the barrier has been released.