1. Field Of The Invention
The present invention relates generally to a memory access scheme for computer systems, and more particularly to a memory access scheme which is achieved using a memory address register and a register-indirect memory accessing mode.
2. Related Art
In many conventional computer systems, memory address generation is performed as part of memory access instructions, such as memory load and store instructions. An example of a conventional computer system with multiple functional units having simultaneous multiple operations is shown in FIG. 1. Specifically, FIG. 1 illustrates a computer subsystem 102 where memory address generation is performed as part of memory access instructions.
The computer subsystem 102 contains a processing unit 104 and a memory unit 134. The processing unit 104 contains multiple general purpose registers (GPR) 106, each of which are connected to an adder 108 for address generation and an arithmetic-logic unit (ALU) 110. An output 126 from the ALU 110 can be transferred (or written back) to the GPRs 106 via line 142. Addresses generated by the adder 108 are transferred to a memory 136 contained in the memory unit 134. Data is transferred between the GPR 106 and the memory 136 via the ALU 110 side of the processing unit 104. That is, data is transferred from the GPR 106 to the memory 136 via the ALU 110. Data is transferred from the memory 136 to the GPRs 106 via a bus 112, an ALU out register 138 (which is associated with the ALU 110), and a line 142.
FIG. 2 illustrates a pipelined timing diagram of the computer subsystem 102 where a memory load instruction 202 starts at time t.sub.0 and a register-to-register instruction 212 starts at time t.sub.1.
The memory load instruction 202 operates as follows. After an instruction fetch cycle 204, a memory address is calculated in the adder 108 during a decode/address generation cycle 206. The memory address is transferred to the memory 136. During a memory access cycle 208, the memory 136 retrieves data (according to the memory address) and places the data in the ALU out register 138. During a write back cycle 210, the data in the ALU out register 138 is written back to the GPRs 106 via the line 142.
The register-to-register instruction 212 operates as follows. After an instruction fetch cycle 214, data from the GPRs 106 is manipulated in the ALU 110 to produce an arithmetic/logic result during an decode/execute cycle 216. During a write back cycle 218, the arithmetic/logic result is written back to the GPRs 106 via the line 142.
Because the memory load instruction 202 contains an additional memory access cycle 208, an uneven pipeline is created. That is, the memory load instruction 202 (and memory access instructions in general) takes one more cycle to execute than the register-to-register instruction 212. As a result of the uneven pipeline, both the memory load instruction 202 and the register-to-register instruction 212 attempt to write back to the GPRs 106 via the line 142 during the same cycle (that is, the cycle beginning at t.sub.3). Therefore, "write back" collisions occur in computer systems where memory address generation is performed as part of the memory access instructions. Such write back collisions degrade processor performance.
In a first prior solution to the write back collision problem, register instructions are extended by 1 cycle. This is illustrated by the timing diagram in FIG. 3, wherein a register-to-register instruction 302 contains an wait cycle 304, in addition to the instruction fetch cycle 214, the decode/execute cycle 216, and the write back cycle 218. The addition of the wait cycle 304 eliminates write back collisions by eliminating the uneven pipeline. However, the addition of wait cycles 304 increases pipeline penalties. Consequently, this solution is flawed because it decreases system performance.
In a second prior solution to the write back collision problem, memory access instructions are executed as 2 cycle instructions. This is illustrated by the timing diagram in FIG. 4, where the start of a register-to-register instruction 402 is delay by one cycle to t.sub.2. While eliminating write back collisions, this solution causes the memory access instructions to effectively execute in 2 cycles, rather than 1 cycle. Thus, this solution is flawed because it decreases system performance.
In a third prior solution to the write back collision problem, an additional GPR write port is added for use during write back cycles. Referring to FIG. 1, an additional GPR write port is added to facilitate a separate path between the GPR 106 and the memory 136. The separate path, working in conjunction with the existing path formed by line 142, eliminates write back collisions. However, this solution significantly increases the hardware costs in computer systems with multiple functional units. Specifically, if a system has N functional units and W write ports per functional unit, then W*N.sup.2 input ports to the GPR 106 are required (if the GPR 106 is partitioned and replicated). Thus, this solution is flawed because it increases the cost of computer systems.
In a fourth prior solution to the write back collision problem, memory address generation is performed in separate instructions, apart from memory access instructions. Performing memory address generation in separate instructions does not have an adverse affect on system performance in computer systems with multiple functional units. The fourth solution is illustrated by the timing diagram in FIG. 5, where an address generation instruction 502 begins at t.sub.0, a memory load instruction 510 begins at t.sub.1, and a register-to-register instruction 518 beings at t.sub.2.
The operation of the address generation instruction 502 and the register-to-register instruction 518 is similar to the operation of the register-to-register instruction 212. In particular, the address generation instruction 502, after generating an address during a decode/execute cycle 506, writes back the address to the GPR 106 during a write back cycle 508.
The operation of the memory load instruction 510 is similar to the operation of the memory load instruction 202. Unlike the memory load instruction 202, however, the memory load instruction 510 does not perform address generation. Instead, the memory load instruction 510 accesses the GPR 106 for the address generated by the address generation instruction 502. The memory load instruction 510 then uses the address to access the memory 136. The memory load instruction performs these two operations during a decode/memory access (GPR) cycle 514 (the "(GPR)" indicates that the address comes from the GPR 106).
Because address generation is not performed in memory access instructions 510, the fourth solution eliminates uneven pipelines and thus eliminates write back collisions.
However, the fourth solution is inefficient because the path length associated with the decode/memory access (GPR) cycle 514 is much longer than the path length associated with the decode/execute cycle 522. In other words, the fourth solution creates an unbalanced pipeline partition because some cycles take significantly longer to execute than other cycles. Thus, the fourth solution is flawed because it results in a much longer cycle time (since cycle time is determined by the longest pipeline path).
Specifically, the decode/memory access (GPR) cycle 514 involves (1) processing time to access the GPR 106 to retrieve the address generated during the address generation instruction 502, (2) propagation delay to send the address from the GPR 106 to the memory 136, (3) memory read latency delay associated with reading the memory 136, and (4) propagation delay to send data from the memory 136 to the ALU out register 138.
In contrast, the decode/execute cycle 522 involves only (1) processing time to access the GPR 106 to retrieve data for the register-to-register instruction 518, and (2) processing time of the ALU 110.
The processing time disparity between the decode/memory access (GPR) cycle 514 and the decode/execute cycle 522 is greater if the processing unit 104 and the memory unit 134 are located on separate chips or circuit boards. The processing unit 104 and the memory unit 134 are likely to be located on separate chips or circuit boards if the processing unit 104 contains multiple functional units.
Uneven pipeline creates additional problems for very long instruction word (VLIW) machines. A VLIW machine contains multiple functional units. The functional units execute multiple operations in parallel and in lock step according to very long instruction words. Since all pipeline interlocks and parallelism extraction are managed by a compiler, the uneven pipeline or long pipeline that is presented to the compiler would significantly complicate the compiler's task.
Therefore, a memory access scheme is required which achieves an uniform pipeline without increasing cycle time.