Communications products require increased computational performance to process digital signals in software on a real time basis. Increases in performance have come through improvements in process technology and by improvements in microprocessor design. Increased parallelism, higher clock rates, increased densities, coupled with improved design tools and compilers have made this more practical. However, many of these improvements cost additional overhead in memory and latency due to a lack of the necessary bandwidth that is closely coupled to the computational units.
The performance level of a processor, and particularly a general purpose processor, can be estimated from the multiple of a plurality of interdependent factors: clock rate,
gates per clock, number of operands, operand and data path width, and operand and data path partitioning. Clock rate is largely influenced by the choice of circuit and logic technology, but is also influenced by the number of gates per clock. Gates per clock is how many gates in a pipeline may change state in a single clock cycle. This can be reduced by inserting latches into the data path: when the number of gates between latches is reduced, a higher clock is possible. However, the additional latches produce a longer pipeline length, and thus come at a cost of increased instruction latency. The number of operands is straightforward; for example, by adding with carry-save techniques, three values may be added together with little more delay than is required for adding two values. Operand and data path width defines how much data can be processed at once; wider data paths can perform more complex functions, but generally this comes at a higher implementation cost. Operand and data path partitioning refers to the efficient use of the data path as width is increased, with the objective of maintaining substantially peak usage.
The last factor, operand and data path partitioning, is treated extensively in commonly-assigned U.S. Pat. Nos. 5,742,840, 5,794,060, 5,794,061, 5,809,321, and 5,822,603, herein incorporated by reference in their entirety, which describe systems and methods for enhancing the utilization of a general purpose processor by adding classes of instructions. These classes of instructions use the contents of general purpose registers as data path sources, partition the operands into symbols of a specified size, perform operations in parallel, catenate the results and place the catenated results into a general-purpose register. These patents, all of which are assigned to the same assignee as the present invention, teach a general purpose microprocessor which has been optimized for processing and transmitting media data streams through significant parallelism.
While the foregoing patents offered significant improvements in utilization and performance of a general purpose microprocessor, particularly for handling broadband communications such as media data streams, other improvements are possible.
Many general purpose processors have general registers to store operands for instructions, with the register width matched to the size of the data path. Processor designs generally limit the number of accessible registers per instruction because the hardware to access these registers is relatively expensive in power and area. While the number of accessible registers varies among processor designs, it is often limited to two, three or four registers per instruction when such instructions are designed to operate in a single processor clock cycle or a single pipeline flow. Some processors, such as the Motorola 68000 have instructions to save and restore an unlimited number of registers, but require multiple cycles to perform such an instruction.
The Motorola 68000 also attempts to overcome a narrow data path combined with a narrow register file by taking multiple cycles or pipeline flows to perform an instruction, and thus emulating a wider data path. However, such multiple precision techniques offer only marginal improvement in view of the additional clock cycles required. The width and accessible number of the general purpose registers thus fundamentally limits the amount of processing that can be performed by a single instruction in a register-based machine.
Existing processors may provide instructions that accept operands for which one or more operands are read from a general purpose processor's memory system. However, as these memory operands are generally specified by register operands, and the memory system data path is no wider than the processor data path, the width and accessible number of general purpose operands per instruction per cycle or pipeline flow is not enhanced.
The number of general purpose register operands accessible per instruction is generally limited by logical complexity and instruction size. For example, it might be possible to implement certain desirable but complex functions by specifying a large number of general purpose registers, but substantial additional logic would have to be added to a conventional design to permit simultaneous reading and bypassing of the register values. While dedicated registers have been used in some prior art designs to increase the number or size of source operands or results, explicit instructions load or store values into these dedicated registers, and additional instructions are required to save and restore these registers upon a change of processor context.
The size of an execution unit result may be constrained to that of a general register so that no dedicated or other special storage is required for the result. Specifying a large number of general purpose registers as a result would similarly require substantial additional logic to be added to a conventional design to permit simultaneous writing and bypassing of the register values.
When the size of an execution unit result is constrained, it can limit the amount of computation which can reasonably be handled by a single instruction. As a consequence, algorithms must be implemented in a series of single instruction steps in which all intermediate results can be represented within the constraints. By eliminating this constraint, instruction sets can be developed in which a larger component of an algorithm is implemented as a single instruction, and the representation of intermediate results are no longer limited in size. Further, some of these intermediate results are not required to be retained upon completion of the larger component of an algorithm, so a processor freed of these constraints can improve performance and reduce operating power by not storing and retrieving these results from the general register file. When the intermediate results are not retained in the general register file, processor instruction sets and implemented algorithms are also not constrained by the size of the general register file.
There has therefore been a need for a processor system capable of efficient handling of operands and results of greater width than either the memory system or any accessible general purpose register. There is also a need for a processor system capable of efficient handling of operands and results of greater overall size than the entire general register file.