Referring to FIG. 1, a multiprocessor 1 is a machine containing more than one data processor (e.g., P0-P3). The data processors may be connected to each other by a bus or by a cross bar switch 2. Each of the processors may have an associated cache memory (C0-C3). The processors P0-P3 share a common system memory 3 through the bus or cross bar switch 2 and the associated cache (if provided). Each processor may also have a private memory (PM) that is not accessible to the other processors.
Each of the processors P0-P3 of the multiprocessor 1 may execute an associated task. For example, an audio application or task may run on one processor while a video application may run on another processor. In this case each processor executes its task in a substantially independent manner without any strong interaction between the tasks running on the other processors.
In other cases, of most interest to this invention, a single task is partitioned into sub-tasks that are then executed cooperatively on two or more processors by assigning one processor to one sub-task. When several processors cooperate in this manner to execute a single task, they typically need to share, in a fair manner, common resources such as the memory 3, as well as buffers, printers, and other peripherals (not shown). In addition, the processors typically need to communicate with one another so as to share information needed at checkpoints, to wait for other processors to complete a certain routine, to signal to other processors that the processor has completed its assigned sub-task, etc.
A "thread" is the analog of a process in an environment where several tasks can be spawned by a single process. More specifically, a thread is one of a set of subprocesses that share a single address space. In this case off-stack (global) variables are shared among all the threads of a given program. Each thread executes a separate call stack having its own separate local variables. All threads within a given process share system resources, such as a process id, a process group id, a session membership, a real, effective and saved set user id, a real, effective and saved set groups id, supplementary group ids, a current working directory, a root directory, a file mode creation mask, and file descriptors. The foregoing list of system resources is exemplary, and not all of these resources may be used in a given application, or more than these listed resources may be used. A thread that is the only member of its subprocess group is equivalent to a process.
A kernel thread refers to the execution of the thread in a kernel space, typically considered in the art to be a privileged space not accessible to user applications. A user thread refers to the execution of the thread in user space. In a threaded environment, m user threads may be mapped on to n kernel threads.
A thread-safe library is one which contains thread-safe functions. A thread-safe function is one which may be safely invoked concurrently by multiple threads. A reentrant function in a thread-safe environment is a function whose effect, when called by two or more threads, is guaranteed to be as if the threads executed the function one after another in an undefined order, even if the actual execution is interleaved. Library functions must be re-entrant for the library to be considered thread-safe.
Currently available thread software packages typically have functions to create a thread and to begin the execution of some function. A newly created thread finishes when the function it executes finishes, or when the thread is explicitly terminated. Thread packages also typically provide a variety of synchronization primitives, such as those used for mutual exclusion such as mutexes, condition variables and semaphores, waiting for events to be posted from other threads, posting events to other threads, etc. Specific details of these thread-related concepts may be obtained from "Operating Systems Principles", Prentice Hall 1973, by Per Brinch Hansen, or from "Cooperating Sequential Processes", Technical Report Technological University 1965, by E. W. Djikstra.
It should be noted that while creating and destroying threads is less computationally expensive than creating and destroying processes, it is still not efficient to create and destroy threads at a fine granularity, wherein small pieces of work or tasks are executed in parallel and may require a high degree synchronization and communication.
A synchronization operation is implied when two or more threads have to share a resource. For example, assume that a thread A is inserting work items into a work buffer that is processed by a thread B. After inserting a work item, thread A increments a count of the work items in the buffer. Similarly, after processing a work item, thread B decrements the count of the work items in the buffer. Assume for this example that the buffer can hold 100 work items, and that the counter is currently at 58. Assume now further that thread A begins to increment the count from 58 to 59, and at the same time thread B begins to decrement the count from 58 to 57. If thread B finishes the decrement operation later, the counter is at 57, if thread A finishes the increment operation later the counter is at 59. Neither counter value is correct, as the correct value is 58. This problem occurs because both thread A and thread B are allowed to operate on the counter at the same time. This is referred to in the art as a synchronization problem. The solution to this problem is to disallow thread B from modifying the counter when thread A is modifying the counter, and vice-versa. Traditional solutions to this problem have resorted to the use of mutual exclusion primitives provided by the operating system. One drawback to this technique is that it involves a system call operation, which can require several tens of processor cycles to execute. As a result, the use of mutual exclusion primitives is not suitable when the work item is small, since the overhead of using the mutual exclusion primitives negates any benefit that can be obtained by using two threads to perform the work.
FIG. 2 conceptually depicts the overall structure of an exemplary application executing a task in parallel, wherein a main thread and a child thread perform the necessary work in a cooperative manner. The main thread gathers work from the application and stores it into work buffers (Task Buffers A-C), and the child thread executes the work items stored in the work buffers. If all of the work buffers are filled, and no buffers are available, the main thread assists the child thread by selecting a work buffer and executing the work items in the selected work buffer. This approach ensures that all processors in the system are utilized with the maximum efficiency, since the processor to which the main thread is assigned is not required to idle until a work buffer becomes available. Since the main thread and the child thread may attempt to access a work buffer at the same time, a situation arises that requires synchronization. That is, some mechanism must be provided to ensure that each work buffer is processed only once, either by the main thread or by the child thread, but not by both threads. In addition, it is important to also ensure that work is performed in a finite amount of time, i.e., there should be no situation wherein the main thread assumes that the child thread will process the work buffer, and vice-versa, as the occurrence of such a situation may cause the work items in the work buffer to never be processed.
Traditionally, synchronization is accomplished by using synchronization primitives provided in the thread library. One example of a thread library is known in the art as the POSIX Pthreads library (see IEEE Standards Project: Draft Standard for Information Technology--Portable Operating System Interface (POSIX) Amendment 2: Threads Extension [C Language] Tech Report P1003.4a Draft 7, IEEE Standards Department, Apr. 23, 1993).
Before claiming a resource a thread must typically first obtain a lock on the resource. By definition, when obtaining the lock the thread knows that no other thread owns the lock for the resource, and that the thread is thus free to use the resource. If a second thread desires to claim the resource, it must wait to obtain the lock until the first thread is finished using the resource. When the first thread finishes using the resource it releases the lock for the resource, thereby allowing other threads to access the resource.
One drawback to the use of this technique is that typically slow lock functions that are defined in the thread library must be executed. Moreover, in actual implementations, the execution of a lock function requires that a request be made for the services of the operating system, which can be a very slow process. Such time penalties are magnified when the work to be performed with the critical resource is itself not very time consuming. Thus, for an application that requires the use of fine-grained synchronization, it is typically not cost effective to use the synchronization primitives provided with the thread library.