Many different types of computing systems have attained widespread use around the world. These computing systems include personal computers, servers, mainframes and a wide variety of stand-alone and embedded computing devices. Sprawling client-server systems exist, with applications and information spread across many PC networks, mainframes and minicomputers. In a distributed system connected by networks, a user may access many application programs, databases, network systems, operating systems and mainframe applications. Computers provide individuals and businesses with a host of software applications including word processing, spreadsheet, accounting, e-mail, voice over Internet protocol telecommunications, and facsimile.
Users of digital processors such as computers continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. In addition, processing speed has increased much more quickly than that of main memory accesses. As a result, cache memories, or caches, are often used in many systems to increase performance in a relatively cost-effective manner. Many modern computers also support “multi-tasking” or “multi-threading” in which two or more programs run concurrently with various resources in the processor pipeline allocated to two different threads on any given cycle.
Modern computers include at least a first level cache L1 and typically a second level cache L2, for increasing the speed of memory access by the processor. This dual cache memory system enables storing frequently accessed data and instructions close to the execution units of the processor to minimize the time required to transmit data to and from memory. L1 cache is typically oil the same chip as the execution units. L2 cache is external to the processor chip but physically close to it. Ideally, as the time for execution of an instruction nears, instructions and data are moved to the L2 cache from a more distant memory. When the time for executing the instruction is near imminent, the instruction and its data, if any, is advanced to the L1 cache.
A common architecture for high performance, single-chip microprocessors is the reduced instruction set computer (RISC) architecture characterized by a small simplified set of frequently used instructions for rapid execution. Thus, in a RISC architecture, a complex instruction comprises a small set of simple instructions that are executed in steps very rapidly. These steps are performed in execution units adapted to execute specific simple instructions. These execution units typically comprise load/store units, integer Arithmetic/Logic Units, floating point Arithmetic/Logic Units, and Graphical Logic Units. In an architecture with multiple execution units, instructions can be issued to two or more of these units to be executed in parallel.
The Arithmetic/Logic Unit (ALU) performs arithmetic operations and logic operations on operands provided to it. For example, FIG. 1 shows a typical architecture of an arithmetic/logic unit in a digital processor. Two N:1 multiplexers, 102 and 104 receive operands from different sources. One such source is the result of instructions that are just finishing execution received from 4:1 multiplexer 118. Each operand, A and B, is latched in latches 106 and 108, respectively. The latch contents are forwarded to the execution macros: an adder 110, a rotator 112, a logical unit 114, and other functions unit 116.
Adder 110 adds the operands, A and B, received from latches 106 and 108. To perform the addition the adder must perform a Generate function 120 and a Propagate function 122. The Generate function is the bitwise logical AND of the two operands. The Propagate function is the bitwise logical OR of the two operands. Rotator 112 receives an operand sand rotates it. Logical unit 114 performs various logical functions such as AND, OR, XOR, etc. In a PowerPC architecture there are 8 basic types of logical operations provided by the Instruction Set Architecture (ISA). These are: AND, NAND, OR, NOR, ANDC, ORC, XOR, and EQV. Their values are given in Table 1.
TABLE 1AandA nandA orA norA andcA orcA xorA eqvABBBBBBBBB0001010101010110001010011011101110100101
FIG. 2 shows an embodiment of a logical unit 114. The operands and their complements are input to certain ones of three-input NAND gates 202, 204, 206, and 208 as shown. The output of these NAND gates is a 4-input NAND gate 210. Selectors C1, C2, C3, and C4 determine the operation performed. By appropriate selection, all eight of the logical operations of Table 1 can be performed. This logic configuration requires a considerable amount of circuitry for its implementation, and thus, more surface area on the processor chip. Further, the logic is slow because the data flows through two stages of multiple input gates.
Thus, there is a need for logic implementation in the arithmetic/logic unit of a digital processor that increases speed and requires less circuitry to implement.