The present invention relates to microprocessors, and in particular, to an apparatus and method for the programmable coupling between a central processing unit (CPU) and one or more co-processors.
Microprocessors based on the ARM architecture typically allow for only a single thread of instruction for any thread or process that is executing at a particular time. Frequently, an ARM microprocessor is implemented to utilize single-instruction issue logic for dispatching instructions down a single processing pipeline. Accordingly, when there are one or more co-processors present, the primary ARM processor (referred to herein as the main core) and the co-processors work serially on the same thread of instruction. This mode of operation is generally referred to as a “coupled” mode of operation, indicating that the one or more co-processors are tightly coupled with the primary processor, or main core. FIG. 1 illustrates an example implementation of a co-processor 10 and an ARM main core 12 configured to operate in coupled mode. As illustrated in FIG. 1, the co-processor is completely dependent upon the main core for receiving instructions (e.g., via instruction path 14) and data (e.g., via load/store path 16).
With single-instruction issue logic, only one instruction gets issued to an instruction pipeline per instruction cycle. With multi-instruction issue logic, it is possible to issue multiple instructions, and hence, more than one processing pipeline may be issued an instruction during a single instruction cycle. However, the nature of a typical application is such that an instruction thread is more likely to occupy one of the co-processors more than the other co-processor(s) or the main core. For instance, consider an ARM main core coupled with a SIMD (single instruction, multiple data integer co-processor, such as a Wireless MMX™ co-processor. While executing the instructions of a video-intensive application, the instructions for performing the video processing are generally executed on the Wireless MMX™ co-processor. Accordingly, most of the instructions execute on the Wireless MMX™ co-processor, and in most of the instruction cycles the main core pipeline is empty or used for loading data to the Wireless MMX™ co-processor. Each instruction cycle for which the main core pipeline has an empty instruction slot (referred to as an idle slot, or stall cycle) represents a processing inefficiency.
FIG. 2 provides a simplified timing diagram to illustrate the general nature of the problem. In FIG. 2, the line designated with reference number 18, going from left to right, represents the passage of time. The line designated with reference number 20 represents the processing of a single instruction thread over a period of time. Specifically, the line designated with reference number 20 indicates whether the main core or the co-processor is actively executing instructions of a particular instruction thread at any given moment in time. For instance, from the beginning (time T=0), the instruction thread (represented by line 20) is being processed by the main core. However, when a particular instruction for the co-processor is encountered, processing of the instruction thread eventually passes to the co-processor. For instance, in FIG. 2 processing passes from the main core to the co-processor at time T=1. During the time that the co-processor is processing the instruction thread, the main core is idle (as indicated by the dotted line designated with reference number 22). Eventually, when the co-processor has completed processing its portion of the instruction thread, processing of the instruction thread will pass back to the main core. For instance, in FIG. 2 processing passes from the co-processor back to the main core at time T=2. As illustrated in FIG. 2, at any particular moment in time, either the main core or the co-processor is idle, thereby introducing inefficiency into the system.
FIGS. 3 and 4 illustrate tables showing examples of the idle instruction slots that are introduced into a main core pipeline during the processing of a video-intensive application. As illustrated in the table of FIG. 3, each table entry in the column with heading “ARM” represents an instruction slot of a main core pipeline for an instruction cycle corresponding with the particular row of the table entry. Similarly, in the table of FIG. 3, each table entry in the column with heading “Co-Processor” represents an instruction slot of the main core pipeline during an instruction cycle corresponding with the particular row of the table entry. For example, as illustrated in the table of FIG. 3, the row labeled as row 1 (representing instruction cycle 1) indicates that the instruction slot of the main core pipeline corresponding with the main core is empty—indicating a stall cycle—while the instruction slot of the main core pipeline corresponding with the co-processor contains an instruction, “WLDRD wR0, [r0]”. From the table shown in FIG. 3, it can be seen that fifty percent of the instruction slots are empty. As such, a video-intensive application executing on an ARM core coupled with a co-processor leaves much to be desired in terms of processing efficiency.
The problem is aggravated ever further when the main core is coupled with multiple co-processors. In FIG. 4, a table showing examples of the idle instruction slots for a main core pipeline of a main core coupled with two co-processors is shown. In particular, a second co-processor representing a data management unit or stream control unit has been added. In this case, the stream control unit processes instructions that “feed” data to the SIMD co-processor, thereby alleviating the main core from this task. As a result, the instruction slot for the main core has even more idle slots. As illustrated in FIG. 4, seventy-eight percent of the instruction cycles of the main core have an idle slot. Again, these idle slots represent a processing inefficiency.