As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
Support the increased parallelism of processor architectures also places unique demands on the software that executes within such architectures, as the communication costs associated with passing data between different hardware threads in a multithreaded architecture can become a bottleneck on overall system performance.
One technique that has been used to facilitate the communication of data between different devices or components implemented within a multithreaded architecture relies on a shared data structure known as a push buffer. For example, a push buffer may be used to convey commands from one device or component (hereinafter referred to as a “host device”) to another device or component (hereinafter referred to as a “target device”). A push buffer has the benefit of requiring a relatively modest amount of shared memory space, along with relatively little synchronization required between the host and target devices.
A push buffer may be implemented, for example, as a circular queue, with head and tail pointers used to point to entries in the queue. The head pointer typically points to the first entry in the queue awaiting processing by the target device, and the tail pointer typically points to the last entry in the queue awaiting processing by the target device, or alternatively the next unused entry. A host device thus places a command on the push buffer in the unused entry nearest the last pending entry using the tail pointer, and then updates the tail pointer to reflect the addition of the new command by updating the tail pointer. The target device, on the other hand, retrieves commands from the push buffer based upon the head pointer, and updates the head pointer as commands are pulled from the push buffer. Whenever a head or tail pointer reaches the end of the memory space allocated to the push buffer, the pointer rolls over to the start of the memory space, hence the circular nature of the push buffer.
While push buffers provide a relatively low overhead mechanism for passing commands between independent operating host and target devices, it has been found that push buffers have drawbacks in many applications. First, if a host device does not place commands on a push buffer as quickly as a target device pulls those commands from the push buffer, the target device may become starved for work and execute below its maximum capabilities. Second, in some applications, a host device may need to communicate a variable amount of data to a target device in connection with a given command. Push buffers typically operate most efficiently when entries are fixed in size, so selecting an entry size that is too small for a particular command may require command data to be passed via several commands and entries, lowering performance, while selecting an entry size that is too large to accommodate some commands is wasteful of valuable memory space for other commands having lower memory requirements.
Therefore, a need exists in the art for improved manner of communicating commands from a host device to a target device via a push buffer.