The present invention pertains to computing systems and the like. More specifically, the present invention relates to reducing the number of instruction transactions in a microprocessor.
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. Conversely, superpipelined microprocessors include a large number of pipeline stages for executing an instruction which is typically carried out by a number of steps that, in turn, are subdivided into the most equal possible substeps. Therefore, in order to execute an instruction completely, all of the substeps must be executed sequentially. FIG. 1 illustrates a conventional executable instruction dataword 100. The instruction datawora 100 is typically formed of an opcode field 102, an operand specifier A and an associated operand specifier B. An execution result specifier field C is used to store the result of the executed instruction. The opcode field 102 defines the specific operation to be performed on the operands A and B. Typical operations include, for example, addition, multiplication, branching, looping, and shifting. The result of such an operation is stored in the execution result data word C that is then made available for subsequent executable instructions.
FIG. 2A illustrates a conventional computing system 200 arranged to perform desired calculations based upon a user supplied program. The typical program used by the computing system 200 is generally formed of an ordered list of executable instructions, referred to as code, each of which is associated with a particular program counter (PC). The computing system 200 includes a programming memory 202 configured to store the executable instructions that form the program at memory locations corresponding to the program counters. Typically, the programming memory 202 is connected to a central processing unit (CPU) 204 by way of a bi-directional programming bus 206. The program counters are, in turn, used to point to the location within the memory 202 at which the corresponding executable instruction is stored.
By way of example, a typical instruction 220 executed by the computer system 200 is shown in FIG. 2B. The line of code 220 includes a program counter (PC) that points to a location 10 in the programming memory 202 where the instruction (composed of the opcode ADD and the respective operand specifiers 20 and 30) to be executed is to be found. In this case, the CPU 204 will add the values stored in locations 20 and 30 and store the result in memory location 100 as indicated by the RESULT specifier field.
Referring again to FIG. 2A, conventional instruction processing generally includes decoding the instruction, executing the instruction, and storing the execution result in the memory location in the programming memory 202 or in a register of the register file identified by the instruction. More specifically, during what is referred to as a fetch stage, the instruction 220 is fetched by a fetching unit 208 from the memory 202 based upon the memory address indicated by the program counter. At this point the fetching unit 208 parses the instruction 220 into the opcode data field 102 and the operand specifiers A and B. The opcode data field 102 is then conveyed by way of an issued instruction bus 210 to a decoder 212. Meanwhile, the operands at A and B are read from the register file 214.
Once received, the decoder 212 uses the opcode data field 102 to select a function unit (FU) such as, for example, function unit 216, arranged to perform the function corresponding to the opcode included in the opcode data field 102. By way of example using the line of code above, the FU 216 is an arithmetic logic unit (ALU) arranged to perform an ADDing operation upon the respective operands A and B stored in the register file 214. At this point, the FU 216 is ready to execute the instruction. It should be noted that the FU 216 can, in fact, be any appropriate function unit capable of executing the function indicated by the instruction opcode. Such functions include, for example, ADDing, as with an arithmetic logic unit (ALU), shifter, multiplier, etc. Once executed, the FU 216 outputs the execution result to the destination specified in C to the register file 214 where it is stored until such time as the value C is passed to the memory.
Operations related to the accessing instructions within the programming memory 202 is a major factor limiting the overall performance of the computing system 200, and more particularly the performance of the CPU 204. Such situations occur, for example, with large memories having long data access times, or in cases where the memory 202 is remotely located from the CPU 204 incurring long transmission delays. In these cases, the performance of the CPU 204 measured in the number of instructions executed per second is limited by the ability to retrieve instructions from the programming memory 202.
Conventional approaches to increasing microprocessor performance (i.e., increasing the number of instructions executed per second) includes adding a cache memory 218 for storing instructions. Even though the cache memory 218 is shown to be internal to the memory 202, it should be noted that the cache memory 218 can also be an external cache memory located outside the main memory 202 in close proximity to the CPU 204. Typically, the cache memory 218 is accessed more quickly than the memory 202. Generally, the cache memory 218 stores instructions from the memory 202 in what is referred to as cache lines. A cache line is formed of a plurality of contiguous bytes which are typically aligned such that the first of the contiguous bytes resides at an address having a certain number of low order bits set to zero. The certain number of low order bits is sufficient to uniquely identify each byte in the cache line. The remaining bits of the address form a tag which is used to refer to the entire cache line.
Even though including larger cache memories may increase the performance of the CPU 204 by making instructions readily available, the larger caches have commensurably longer cache access times. Longer cache access times restricts system performance by limiting the number of instructions per second available for the CPU 204 to execute regardless of the inherent clock cycle time of the CPU 204. As used in this discussion, the term cache access time refers to the interval of time required from the presentation of an address to the cache until the corresponding bytes are available for use by the CPU 204. As an example, a set associative cache access time includes time for indexing the cache storage, time for comparing the tags to the access address in order to select a row, and time for conveying the selected data from the cache.
Increasing cache access time is particularly deleterious to instruction caches used with high frequency microprocessors. By increasing the cache access time, the bandwidth of the issued instruction bus 210 is substantially reduced, particularly when cache access time becomes longer then the clock cycle time of the CPU 204.
In view of the foregoing, it should be apparent that increasing instruction issue bus bandwidth without resorting to increasing cache memory size would be desirable.
An improved system used to improve the performance of a microprocessor is described. More specifically, the system is arranged to increase the number of instructions executed by a microprocessor by selectively storing instructions in a cache memory associated with a corresponding function unit in the microprocessor. In one embodiment of the invention, a method for reducing the number of issued instructions in a computer system is described. In one embodiment, if a fetched instruction program counter (PC) matches a cached instruction tag, then an opcode and the associated instruction are directly injected to the function unit identified by the opcode without fetching from memory. In this way, the issued instruction bus is bypassed which increases the effective bandwidth of the issued instruction bus.
In another embodiment, an apparatus for reducing the number of instructions carried by an issued instruction bus and a program bus in a computing system having a central processor unit (CPU) is disclosed. The CPU is connected to a program memory by way of the program bus and includes a fetching unit connected to the program bus for fetching instructions from the program memory as directed by the CPU. The CPU also contains a plurality of function units each capable of configurably performing a specified operation as directed by the CPU based upon an opcode included in an issued instruction. Each of the function units is connected to the fetching unit by way of the issued instruction bus and receives appropriate issued instructions based upon the opcode. The apparatus includes a plurality of tag program counter (PC) cache memory devices each being associated with one of the function units. Each of the plurality of tag PC cache memory devices stores a corresponding tag PC and a target PC. An injector unit couples each of the plurality of tag PC cache memory devices to each of their respective function units such that when a fetched instruction includes a PC that matches a tagged PC stored in the tag PC cache memory devices, the PC is changed to the target PC and the actual instruction rather than being fetched from the memory is injected into the function unit by the tagged cache associated with that function unit.
In another embodiment of the invention, a method for reducing the number of instructions carried by an issued instruction bus and a program bus in a computing system having a central processor unit (CPU) is disclosed. The CPU is connected to a program memory by way of the program bus and includes a fetching unit connected to the program bus. The CPU also includes a plurality of function units each of which is connected to the fetching unit by way of the issued instruction bus. The CPU also includes a plurality of tag program counter (PC) cache memory devices each being associated with one of the function units. The CPU further includes an injector unit coupling each of the plurality of tag PC cache memory devices to each of their respective function units such that each of the plurality of tag PC cache memory devices stores a corresponding tagged PC, a target PC, and an target instruction. The target instruction has an associated target opcode used to select the function unit associated with the respective tag PC cache memory device. The method is performed by the following operations. First, an instruction is fetched from the program memory based upon a program counter (PC) associated with the instruction. If the instruction PC does match a tag PC, the instruction PC is updated to a target PC associated with the tagged PC entry in the tag PC cache memory. The target opcode and issuing the target instruction are then directly injected to the function unit corresponding to the target opcode. If it is determined that the instruction PC does not match any entry in the tag PC cache memory, the instruction PC is incremented by the corresponding issue group size and then, in either case, the instruction is executed.
In still another embodiment, a method for executing an instruction in a central processing unit (CPU), the CPU comprising a fetching unit, a plurality of functions units, an instruction bus coupled between the fetching unit and the plurality of function units, a plurality of caches each corresponding to one of the plurality of functions units, and an injector unit coupling each of the caches to its corresponding function unit, the method is described. The instruction is fetched from memory associated with the CPU and it is then determined whether the instruction corresponds to any entries in any of the caches. If the instruction corresponds to an entry in one of the caches, the instruction is injected directly into a corresponding one of the function units via the injection unit thereby bypassing the instruction bus. However, where the instruction does not correspond to an entry in one of the caches, the instruction is transmitted to an appropriate function unit via the instruction bus.