1. Field of the Invention
The present invention relates to a digital signal processor (DSP), and more specifically to an architecture that avoids problems resulting from memory latency.
2. Discussion of the Related Art
FIG. 1 schematically and partially shows a conventional DSP architecture. The DSP includes four processing units operating in parallel. Two of these units are memory access units (MEMU) 10. An arithmetic and logic unit (ALU) 12 and a branch unit (BRU) 14 are further provided. Each of the MEMU units is associated with a memory 16 via an independent bus.
Branch unit BRU receives from an instruction memory, not shown, a compound instruction INST which can include four elementary instructions to be provided to each of the units in parallel. Unit BRU retrieves the instruction meant for it and distributes in parallel the three remaining instructions I1, I2, and I3 to the ALU and MEMU units.
Each of the ALU and MEMU units generally includes an instruction queue 18, in the form of a FIFO, in which the instructions wait before they are processed by the corresponding unit.
A DSP of the type of FIG. 1 is optimized to perform vectorial operations of the type X[i] OP Y[j], where i and j vary, generally in a loop, and where OP designates any operation to be performed by arithmetic unit 12. Indeed, operands X[i] and Y[j] can be fetched together via the two buses of memory 16 and processed, in theory, in the same cycle by ALU 12.
In practice, difficulties arise due to the structure of currently used memories, generally SRAMs. Although a memory access can be performed at each cycle, the reading of data from a conventional SRAM generally has a latency of two cycles. Indeed, upon execution of a read instruction, the address is presented to the memory. An additional cycle is required to provide the memory with a read access signal, and a last cycle is required for the memory to present the data over its data bus.
To illustrate the resulting difficulties, a common loop, the function of which is to increment by a constant successive values stored in the memory will be considered hereafter as an example. This loop may directly be written as:
xe2x80x83LD: R1=[i]
OP: R1=R1+R2
ST: [i]=R1
BR: test i, i++, loopxe2x80x83xe2x80x83(1)
This loop, for clarity, uses a single MEMU unit. It consists of loading (LD) in a register R1 the value stored in the memory at address i, incrementing (OP) the content of register R1 by the value contained in a register R2, storing (ST) at address i the new content of register R1, and finally incrementing and testing address i to resume the loop (BR). The loop will be left when branch unit BRU detects that address i has reached a predetermined value. In a DSP, there are generally no BR-type instructions. The loops are programmed in advance by setting registers provided for this purpose in unit BRU which performs the tests, incrementations and branchings independently.
Register R1 is a work register of the ALU, while address i is stored in a register of branch unit BRU. Operations LD and ST are operations to be performed by one of units MEMU, operation OP is to be performed by unit ALU, and operation BR is to be performed by unit BRU. Operations LD and OP will be provided in parallel to units MEMU and ALU in a same compound instruction, while operations ST and BR will be provided in parallel to units MEMU and BRU in a second compound instruction.
In fact, some compound instructions include fields which are provided to several units at a time. For example, a load instruction LD, meant for a unit MEMU, also includes a field meant for unit ALU to prepare one of its registers (R1) to receive the data which will be presented by the memory. Similarly, a store instruction ST includes a field meant for unit ALU to select a register, the content of which is presented over the memory bus. Thus, as shown in FIG. 1, a field f of each of instructions I2 and I3 provided to units MEMU is provided to the instruction queue 18 of unit ALU in parallel with a normal instruction I1, and unit ALU is able to perform in one cycle a normal instruction and an operation indicated by a field f.
The following table illustrates, for several iterations of the loop, the operations performed by one memory access unit MEMU and by arithmetic unit ALU. The branching instructions BR do not raise any problem and the table does not illustrate them, for clarity.
Each row in the table corresponds to an instruction cycle and each operation marked in the table is assigned with a number corresponding to the loop iteration.
At the first cycle, units MEMU and ALU receive the first instructions LD and OP (LD1, OP1). Unit MEMU immediately executes instruction LD1, which starts the read cycle of the value stored at address i in the memory. Instruction LD1 is deleted from the instruction queue of unit MEMU. Instruction OP1, which needs the value fetched by instruction LD1, cannot be executed yet. This instruction OP1 waits in the instruction queue of unit ALU.
At the second cycle, unit MEMU receives the first instruction ST (ST1). Instruction ST1, which needs the result of operation OP1, cannot be executed yet and waits in the instruction queue of unit MEMU. Instruction OP1 still waits in the queue of unit ALU, since the memory still does not send back the operand that it requires.
At the third cycle, units MEMU and ALU receive instructions LD2 and OP2. These instructions are put in the queues after the still unexecuted instructions ST1 and OP1. The memory finally sends back the operand required by instruction OP1. This instruction OP1 is then executed and deleted from the instruction queue.
At cycle 4, unit MEMU receives instruction ST2. Instruction ST2 is put to wait in the queue of unit MEMU after instruction LD2. Since instruction OP1 was executed at the previous cycle, its result is available. Instruction ST1 can thus be executed and deleted from the queue. Although instruction OP2 is alone in the queue of unit ALU, this instruction cannot be executed yet since it requires an operand which will be fetched by the execution of instruction LD2.
At cycle 5, units MEMU and ALU receive instructions LD3 and OP3. Instruction LD2 is executed and deleted from the queue.
Instruction OP2 must still wait in the queue, since it requires an operand which will be sent back two cycles later by the memory in response to instruction LD2.
From the fifth cycle on, the execution of the instructions of the second iteration of the loop proceeds as for the first iteration starting at cycle 1.
As shown by the table, although the processor is capable of performing one memory access at each cycle, it only performs memory access two cycles out of four, that is, the loop execution efficiency is only 50%.
Further, upon each new iteration of the loop, the instruction queue of unit MEMU fills with additional instructions and ends up overflowing. To avoid the overflow, the provision of instructions must be regularly stopped to enable the queues to empty. This considerably decreases the efficiency.
In fact, the loop programming in its straightforward form is not at all optimal due to the memory latency.
To improve the efficiency, taking account of the memory latency, a so-called loop unrolling technique is often used. This technique consists of programming a macroloop, each iteration of which corresponds to several iterations of the normal loop. Thus, the preceding loop (1) is written, for example, as:
Lda: R1=[i]
Opa: R1=R1+R2
Ldb: R3=[i+1]
Opb: R3=R3+R2
Sta: [i]=R1
Stb: [i+1]=R3
BR: test i, i=+2, loopxe2x80x83xe2x80x83(2)
In this loop, the value contained at address i is loaded (LDa) in register R1, the content of register R2 is incremented (OPa) by the content of a register R1, the value contained at address i+1 is loaded (LDb) in a register R3, the content of register R3 is incremented (OPb) by the value contained in register R2, the content of register R1 is stored (STa) at address i, the content of register R3 is stored (STb) at address i+1, and variable i is incremented by 2 to restart the loop.
This loop is programmed in four compound instructions. The first one is formed of operations LDa and OPa, the second one of instructions LDb and OPb, the third one of instruction STa, and the fourth one of instructions STb and BR. The following table illustrates the sequence of operations for several loop iterations.
At the first cycle, units MEMU and ALU receive the first instructions LDa and OPa (LDa1, OPa1). Instruction LDa1 is immediately executed and deleted from the queue.
At the second cycle, units MEMU and ALU receive instructions LDb1 and OPb1. Instruction LDb1 is immediately executed and deleted from the queue. Instructions OPa1 and OPb1 remain in the queue of unit ALU waiting for the corresponding operands that the memory has to send back in response to instructions LDa1 and LDb1.
At the third cycle, unit MEMU receives instruction STa1. The memory sends back the operand asked for by instruction LDa1 and required by instruction OPa1. Instruction OPa1 can thus be executed and deleted from the queue.
At the fourth cycle, unit MEMU receives instruction STb1. The memory sends back the operand asked for by instruction LDb1 and required by instruction OPb1. Instruction OPb1 can thus be executed.
At the fifth cycle, units MEMU and ALU receive instructions LDa2 and OPa2. Instruction STa1 can be executed since the value that it requires has been calculated by instruction OPa1 two cycles before.
At the sixth cycle, units MEMU and ALU receive instructions LDb2 and OPb2. Instruction STb1 is executed since the value that it requires has been calculated by instruction OPb1 two cycles before.
At the seventh cycle, unit MEMU receives instruction STa2. A new iteration of the loop is started by the execution of instruction LDa2.
The processor appears in this table to perform four memory accesses every six cycles, which amounts to a 66% efficiency and a 33% gain with respect to the preceding solution.
The queue of unit MEMU however appears to progressively fill up, requiring to regularly stop the instruction provision. It fills with two instructions every six cycles instead of with two instructions every four cycles as was the case for the previous solution.
The loop unrolling technique, although substantially improving the efficiency, is not an optimal solution for superscalar processors. In fact, it works much better on scalar processors.
An object of the present invention is to provide a superscalar processor architecture having a maximum efficiency for the execution of loops including memory access instructions.
This object as well as others is achieved by means of a processor including at least one memory access unit for presenting a read or write address over an address bus of a memory in response to the execution of a read or write instruction; and an arithmetic and logic unit operating in parallel with the memory access unit and arranged at least to present data on the data bus of the memory while the memory access unit presents a write address. The processor includes a write address queue in which is stored each write address provided by the memory access unit waiting for the availability of the data to be written.
According to an embodiment of the present invention, the arithmetic and logic unit includes two independent instruction queues intended for receiving instructions waiting for execution, a first of the instruction queues being intended for receiving logic and arithmetic instructions, and the second instruction queue being intended for receiving instruction fields provided to the memory access unit to identify registers of the arithmetic and logic unit which are involved in read or write operations.
According to an embodiment of the present invention, the arithmetic and logic unit includes a store data queue in which each datum to be written in the memory waits for the presence of a write address in the write address queue.
According to an embodiment of the present invention, the arithmetic and logic unit includes a load data queue in which is written each datum from the memory for the arithmetic and logic unit, waiting for the arithmetic and logic unit to be available.
According to an embodiment of the present invention, the processor includes a branch unit for receiving instructions and distributing them in parallel between itself, the memory access unit and the arithmetic and logic unit.
According to an embodiment of the present invention, each of the units includes a store data queue in which each datum to be written in the memory waits for the presence of a write address in the write address queue.
According to an embodiment of the present invention, each of the units includes a load data queue in which is written each datum from the memory for the unit, waiting for the unit to be available.
The foregoing objects, features and advantages of the present invention will be discussed in detail in the following non-limiting description of specific embodiments in connection with the accompanying drawings.