The present invention relates generally to data processors, and more particularly to methods and apparatus for grouping data processor instructions and an instruction system for use therewith.
Many different types of data processors are available. Some data processors have multiple execution units that may be used concurrently. Scheduling of instructions for such data processors can be either dynamic or static. Both types of systems operate on a sequential instruction stream which has been prepared for execution using conventional program preparation software tools, including optimizing assemblers and compilers. In general, dynamic systems require significantly more hardware in the data processor, while static systems require more sophisticated program preparation software techniques. The common goal, however, is to identify and exploit instruction level parallelism inherent in the instruction stream while maintaining the appearance of sequentiality of execution.
In a dynamic instruction scheduling system, special hardware within the data processor maintains a sliding window of visibility into the sequential instruction stream. Each instruction dispatch cycle, the scheduling hardware selects as many of the visible instructions as can be instruction serial constraints. Additional hardware maintains a record of each instruction while in flight and, depending upon system conditions, either aborts or retires the instruction appropriately. An example of a dynamically scheduled data processor is the Motorola MPC604 microprocessor.
In a static instruction scheduling system, the program preparation software tool, after it has generated and, perhaps, optimized the serial instruction stream, reexamines that stream and, based upon information describing the hardware configuration and operating characteristics of the target data processor, groups together those instructions that can safely be executed in parallel. Due to the difficulty of predicting the actions of certain program constructs, such as indirect or computed memory references, it is not possible to guarantee optimal scheduling in advance. To accommodate such non-predictable constructs, some hardware interlocks will usually be provided. An example of a statically scheduled data processor was the Multiflow Trace 7/1428. The compiler for the Trace machine was commonly referred to as the Bulldog compiler, the name given it by its original authors while at Yale University.
In grouping instructions for the Trace, which was a Very Long Instruction Word (VLIW) machine, the Bulldog compiler was constrained to a VLIW having either 7, 14 or 28 fixed function instruction slots, depending upon the machine model. Any instruction slot for which the compiler could not find useful work was simply filled with a no-operation (NOP) instruction (i.e., all zeroes). Rather than store these useless NOPs in memory, the compiler squashed out the NOPs and preceded the set of useful instruction words comprising each VLIW with a bit map which indicated the location of the squashed NOPs (or, viewed conversely, the useful instruction words). At prefetch time, the Trace cache/memory controller used the information in the bit map word to regenerate the NOPs so that the cache was filled with fully populated VLIWs. The bit map was discarded once the corresponding VLIWs were regenerated during prefetch, and no part of the instruction dispatch or execution hardware was even aware of their existence. This mechanism, even though it increased by one word the logical length of every VLIW in memory, generally tended to reduce the physical length of the stored VLIWs due to the inability of the compiler to fill all of the instruction slots in every VLIW with useful instruction words. On the other hand, for well designed code, this mechanism could significantly increase the actual code size in memory.
A more recent example of a statically scheduled data processor is the Texas Instruments TMS320C62xx microprocessor family (""C62). In the ""C62, every instruction word includes a dedicated xe2x80x9cPxe2x80x9d bit which, if set by the program preparation software tool, indicates to the dispatch hardware that the instruction word can be dispatched in parallel with the following instruction word. Thus, a simultaneously dispatchable xe2x80x9cexecution packetxe2x80x9d is comprised of an instruction word having a clear P bit and up to a maximum number of preceding instruction words, each having a set P bit. U.S. Pat. No. 5,560,028 discloses a variation on this mechanism in which the sense of the parallel dispatch control bit is toggled between each set of parallel-dispatchable instruction words. In the above statically scheduled systems, by dedicating a bit of each instruction to the grouping function, a significant portion of the instruction is not useable for other functions, such as encoding data processing operations.
A premium is placed on execution speed for processing data and instructions using such a multiple execution data processor. However, as the complexity of the data processor architecture is increased using multiple execution units, the computer instruction code size tends to increase. In many applications, increased code size is undesirable due to increased cost and space needed for additional memory. One such application is a class of data processors known as digital signal processors (DSP). DSPs are used in many applications, such as cellular phones, where a premium is placed on small size and low power. It would be desirable for a multiple execution unit data processor to provide faster instruction processing without significantly expanding instruction code size.
Accordingly, there is a need for improved methods and apparatus for grouping computing instructions and for an improved instruction system.