Complex, performance-oriented processor subsystems consist of a processor core and some number of coprocessors. The processor off-loads tasks to the coprocessors, which typically perform specialized functions that they are optimized for. The type and number of coprocessors depends on many system aspects such as performance requirements, the type of processing tasks that need to be off-loaded from the core, and power and size considerations.
One technique for connecting a core to its associated coprocessors involves direct connection between a core and a coprocessor. The core connects directly to the coprocessor through a coprocessor interface. The interface typically consists of signals going from the core to the coprocessor that indicate the function that the coprocessor is to perform along with the arguments that the function is to be performed on. Signals from the coprocessor to the core are used to convey the results of the function. This type of connection may be used for a coprocessor that has a relatively shallow processing pipeline. Typically, the thread that the core is executing and that transfers work to the coprocessor is stalled until the coprocessor returns a result to the core. The coprocessor does not have many execution threads, and may be single-threaded. The coprocessor executes only one unit of work at a time. An example of this type of coprocessor may be a floating point unit. Such an interface has no facility for handling a backlog and for deferring coprocessor work. This technique may be suboptimal due to dark (idle) silicon and reduced system throughput.
Another technique involves indirect connectivity between core and coprocessor. Function invocations or other work units for the coprocessor are placed in a common memory that both the core and coprocessor have access to. The core builds a work queue for the coprocessor in the common memory and starts the coprocessor executing the work queue via a configuration register access to the coprocessor. The coprocessor executes the functions on the work queue and returns results to a dedicated memory in the coprocessor that the core has access to, a common shared memory, or directly to the core. An example of this type of coprocessor would be a direct memory access (DMA) engine. However this technique involves the core initializing configuration registers, the coprocessor updating the configuration registers to indicate completion status, and the core monitoring the configuration registers, perhaps by polling. The technique may be suboptimal due to complexity of coordination logic in the core and coprocessor and increased power consumption.