A computer system typically contains a CPU, a main memory, and one or more input/output (I/O) devices. FIG. 1 is a simplified diagram of a computer system 100. The CPU 101 fetches instructions from the main memory 102, and then executes these instructions. Main memory 102 is a memory storage device that stores blocks of instructions and data copied from an external disk memory 111 that is accessed via the I/O devices 103. I/O devices 103 are used to access external devices such as disk memory 111, user input devices 112 (e.g., keyboards), and display devices 113 (e.g., monitors).
Memory access times play an important role in determining the operating speed of a computer system. Accesses to disk memory are much slower than accesses to main memory, because the instructions and data must be provided through an I/O device. Therefore, the main memory is provided to reduce the frequency of accesses to disk memory. However, instructions that require accessing main memory are still significantly slower than instructions that can be carried out entirely within the CPU.
FIG. 2 shows a first type of CPU having an “accumulator-based” CPU architecture. Accumulator-based CPU 200 includes an instruction register 201, an accumulator 202, and an operation block 203. Instruction register 201 is a register in which the currently-executing instruction is stored. Accumulator 202 is a special register that provides one of the values on which the current instruction operates, and for some instructions (e.g., when the instruction provides a numerical result) is also used to store the result of the instruction. Operation block 203 is a control and execution circuit that can include, for example, an Arithmetic Logic Unit (ALU), a program counter register containing an address pointer to the main memory location in which the next instruction is stored, a parallel port providing access to the main memory, and so forth.
Accumulator-based CPUs were among the earliest-developed CPUs. They are best used in architectures having a relatively small instruction size, e.g., 8–16 bits. To reduce the instruction size, only one source address is included in the instruction, and no destination address is included. Instead, the value in the accumulator is always used as one of the operands, and the destination address is always the accumulator. Thus, at most one memory address is included in the instruction, that of the second operand.
Because only one operand is specified in each instruction, accumulator-based CPUs allow efficient instruction encoding and decoding, which decreases the cycle time of the CPU.
As an example of accumulator-based operation, the following sequence of pseudo-code instructions performs the function “a=b+c+d” in an accumulator-based CPU. The letters “a”, “b”, “c”, and “d” are addresses in main memory. The term “Acc” refers to the accumulator. Note that four memory accesses are required; three to fetch the operands, and one to store the result. Each of these memory accesses has an associated latency, which is added to the latency of the arithmetic (e.g., addition) operation.
(1)loadb// Acc ← b(2)addc// Acc ← Acc + c(3)addd// Acc ← Acc + d(4)storea// a ← Acc
In step (1), the value at memory location “b” is loaded into the accumulator. In step (2), the value at memory location “c” is added to the value in the accumulator. In step (3), the value at memory location “d” is added to the value in the accumulator. In step (4), the value in the accumulator is stored in memory location “a”.
FIG. 3 shows another CPU architecture called a “load-store” architecture. A load-store architecture does not include an accumulator; instead, a register file 304 is used. (Other portions of CPU 300 are similar to those of FIG. 2; therefore, they are not further described here.) Register file 304 includes several registers that can be used as source registers and destination registers for instructions executed by the operation block.
For example, the following sequence of pseudo-code instructions performs the function “a=b+c+d” in a load-store CPU. In this CPU, the register file includes at least five registers, R1–R5.
(5)loadR1,b// R1 ← b(6)loadR2,c// R2 ← c(7)loadR3,d// R3 ← d(8)addR4,R1,R2// R4 ← R1 + R2(9)addR5,R4,R3// R5 ← R4 + R3(10)storea, R5// a ← R5
In step (5), the value at address “b” is stored in register R1. In step (6), the value at address “c” is stored in register R2. In step (7), the value at address “d” is stored in register R3. In step (8), the values stored in registers R1 and R2 are added, and the result is stored in register R4. In step (9), the values stored in registers R4 and R3 are added, and the result is stored in register R5. In step (10), the value stored in register R5 is stored in address “a” of the main memory.
In comparing the two instruction sequences, it can be seen that the same number of memory accesses are required, i.e., three memory reads to load the values stored at locations “b”, “c”, and “d”, and one memory write to store the result at location “a”. However, in the load-store sequence (steps (5)–(10)), the memory accesses (i.e., the load and store commands) have been separated from the add instructions. This separation allows for simpler instructions (e.g., a simpler operation block) and a consequent faster CPU cycle time.
Additionally, separating memory accesses from execution instructions such as the add instruction allows compilers to produce highly optimized code. For example, the values of “b”, “c”, “d”, “b+c”, and “b+c+d” remain in the register file, and can be reused by the program at a later time without fetching the values from memory or recalculating the addition results. Thus, the total number of memory accesses is typically reduced. Because memory accesses often make a significant contribution to the overall execution time of a program, a load-store CPU can execute some types of code significantly faster than an accumulator-based CPU. However, load-store architectures typically require a larger instruction size, in order to specify two operands and a destination address.
Another type of CPU architecture combines the architectural features of the accumulator-based and load-store CPUs. FIG. 4 shows a first such architecture, a load-store CPU with a fixed accumulator. CPU 400 includes both an accumulator 402 and a register file 404. Values are loaded from main memory to the accumulator, stored into main memory from the accumulator, and moved back and forth between the accumulator and the register file. The accumulator also provides one operand and serves as the destination address for instructions. Thus, the register file essentially provides a “local memory” for the accumulator.
Following is an exemplary sequence of instructions that execute the function “a=b+c+d” in the accumulator-based load-store architecture of FIG. 4.
(11)loadb// Acc ← b(12)moveaR1// R1 ← Acc(13)loadc// Acc ← c(14)moveaR2// R2 ← Acc(15)loadd// Acc ← d(16)addR2// Acc ← Acc + R2(17)addR1// Acc ← Acc + R1(18)storea// a ← Acc
In step (11), the value at address “b” is stored in the accumulator. In step (12), the value in the accumulator is stored in register R1. In step (13), the value at address “c” is stored in the accumulator. In step (14), the value in the accumulator is stored in register R2. In step (15), the value at address “d” is stored in the accumulator. In step (16), the value in register R2 is added to the accumulator. In step (17), the value in register R1 is added to the accumulator. In step (18), the value in the accumulator is stored in address “a” of the main memory.
The accumulator-based load-store CPU of FIG. 4 has the advantage that small instruction sizes can be used, because only one operand is required, as in the accumulator-based CPU of FIG. 2. However, any operation performed changes the value in the accumulator. This makes it difficult for a compiler to optimize the code.
FIG. 5 shows another CPU architecture that more successfully combines the virtues of the accumulator-based and load-store architectures, a load-store CPU with a moveable accumulator. CPU 500 includes a register file 504 in which any one of the registers can act as an accumulator. An accumulator pointer 505 selects one of the registers in register file 504 and designates that register as the accumulator. The value of the accumulator pointer can be changed using a “set” instruction. By setting the location of the accumulator prior to executing another instruction, operations can be performed in any register in the register file, and the results can be left in the register file for later use, minimizing accesses to main memory.
For example, the following pseudo-code implements the function “a=b+c+d” in the accumulator-based load-store architecture of FIG. 5.
(19)set1// Acc = R1(20)loadb// R1 ← b(21)set2// Acc = R2(22)loadc// R2 ← c(23)set3// Acc = R3(24)loadd// R3 ← d(25)addR2// R3 ← R3 + R2(26)addR1// R3 ← R3 + R1(27)storea// a ← R3
In step (19), register R1 of the register file is selected to act as the accumulator. In step (20), the value at address “b” is stored in register R1. In step (21), register R2 of the register file is selected to act as the accumulator. In step (22), the value at address “c” is stored in register R2. In step (23), register R3 of the register file is selected to act as the accumulator. In step (24), the value at address “d” is stored in register R3. In step (25), the value in register R2 is added to the value stored in register R3. In step (26), the value in register R1 is added to the value stored in register R3. In step (27), the value in register R3 is stored in address “a” of the main memory.
As described above, the accumulator-based load-store CPU architecture shown in FIG. 5 successfully combines the advantages of accumulator-based and load-store architectures. Only a single operand is included in each instruction, so the instruction size can be small. However, the moveable accumulator permits a compiler to retain the operands of previous instructions in the register file, which can significantly reduce the number of memory accesses.
The use of programmable logic devices (PLDs) to implement CPUs is increasing rapidly. PLDs are now available that include dedicated on-board CPUs, such as the Virtex®-II Pro family of field programmable gate arrays (FPGAS) from Xilinx, Inc. However, some PLD users prefer to implement “soft processors” in their PLDs, i.e., microprocessors built from the fabric of programmable logic blocks traditionally included in PLDS, and configured using a configuration bitstream. Because a “soft” PLD implementation generally uses more silicon area than a processor designed using dedicated transistors (a “hard” processor), these soft processors preferably have a small instruction size.
Therefore, it is desirable to provide a PLD implementation of an accumulator-based load-store CPU architecture that promotes the efficient use of PLD resources and the rapid execution of CPU instructions.