1. Field of the Invention
The present invention relates the design of multiprocessor systems. More specifically, the present invention relates to a method and an apparatus that facilitates inter-processor communication and synchronization through a hardware message buffer.
2. Related Art
As increasing semiconductor integration densities allow more transistors to be integrated onto a microprocessor chip, computer designers are investigating different methods of using these transistors to increase computer system performance. Some computer designers have begun to incorporate multiple processors into a single microprocessor chip. This can potentially speed up the execution of computational tasks by allowing a given computational task to be divided into sub-tasks that can be performed by multiple processors executing in parallel. Furthermore, by locating the processors on the same semiconductor chip, the performance-limiting effects of inter-processor communication delays can be significantly reduced.
Thus, multiple processors within a single semiconductor chip can be used to perform multi-threaded applications, wherein the multiple processors execute threads that operate on independent subtasks of a workload. However, many computational tasks cannot be efficiently partitioned into independent subtasks because of data dependencies.
For example, some loops can be parallelized by performing loop unrolling and software pipelining. In this way, a first processor can work on a given iteration of a loop while a second processor works on a subsequent iteration of the loop. However, data dependencies can cause synchronization problems because a given loop iteration may write to a data value that is used in a subsequent loop iteration. Hence, the subsequent loop iteration cannot proceed until the given loop iteration performs the write operation.
These dependencies can be handled by synchronizing processors through inter-processor locks or memory barriers. However, using inter-processor locks or memory barriers can be prohibitively expensive because they often require different processor structures, such as load queues and store queues, to be flushed. Furthermore, the process of acquiring a lock variable may involve expensive cache coherence operations. Note that the overhead of using locks or memory barriers may be acceptable in loosely coupled parallel tasks that use locks infrequently. However, for more tightly coupled parallel applications, with more frequent data dependencies, the cost of using locks or memory barriers can largely mitigate the performance benefits derived from parallel execution.
What is needed is a method and an apparatus that facilitates inter-processor communication and synchronization without the performance problems associated with using locks or memory barriers.