This application claims priority from European patent application number 00108699.0, filed Apr. 20, 2000, which is hereby incorporated herein by reference in its entirety.
The present invention relates to improvement of storage devices in computer systems and in particular, it relates to an improved method and system for efficiently accessing multi-port cell array circuitry.
In modern computer processor architecture development an increasing portion of processor work is still continued to be parallelized. During parallelization an increasing number of processing sub-units should be allowed and be enabled to access one and the same storage location in order to be able to compute as quickly as possible. Thus, such a storage location requires multiple read/write accessibility.
An example is out-of-order processing. Writing data into arrays of such storage locations in parallel from multiple sources, or reading data from arrays in parallel to multiple targets then requires multi-port cells.
The area and performance of such an array is mainly determined by the number of ports per cell and not by the data size to be stored. More precisely, the area consumption of such an array is nearly proportional to the square of the number of ports implemented.
As one storage cell needs m read ports in order to be readable concurrently by a number of m different reading targets and it needs a number of n write ports for n write sources to write in the cell, and each port comprises a pair of a respective data line and select line being orthogonal to each other, the area consumption increases remarkably with increasing m, or n. For example, when in a m=n=1, two ports case a given array has an area consumption of X, and the array should now be replaced by a multiple access array of m=n=4, 8 ports, then, the resulting area consumption is about (8xc3x978)/(2xc3x972)=16 times higher, i.e., 16 xc3x97. Thus, increasing parallelization requires a large additional area consumption on any processor chip.
Although the present invention has a broad field of application as improving or optimizing storage strategies is a very general purpose in computer technology, it will be described and discussed with prior art technology in a special field of application, namely in context of utilizing a so-called instruction window buffer, further abbreviated as IWB, which is usually present in most modern computer systems in order to enable a parallel program processing of instructions by a plurality of processing units. Such processors are referred to herein as out-of-order processors.
In many modern out-of-order processors such a buffer is used to contain all the instructions and/or register contents before the calculated results can be committed and removed from the buffer. When results were calculated speculatively beyond the outcome of a branch instruction, they can be rejected once the branch prediction becomes wrong just by simply cleaning these entries from the buffer and overwriting them with new correct instructions. This is one prerequisite for the out-of-order processing. One main parameter influencing the performance of the processors is the buffer size: A big buffer can contain many more instructions and results and therefore allows more out-of-order processing. One design objective therefore is to have a big buffer. This however stays in conflict with other design requirements such as cycle time, buffer area, etc. When, for example, the buffer size is dimensioned too large then the efforts required to manage such a large plurality of storage locations decreases the performance of the buffer. Furthermore, increased buffer size implies an increased signal propagation delay. Thus, generally, any improved storage method has to find a good compromise between the parameters buffer size, storage management and therewith storage access speed.
The present invention primarily covers the buffer size and the associated signal propagation delay.
A prior art instruction window buffer as it is disclosed in U.S. Pat. No. 5,923,900, xe2x80x9cCircular Buffer With N Sequential Real And Virtual Entry Positions For Selectively Inhibiting N Adjacent Entry Positions Including The Virtual Entry Positionxe2x80x9d, which is hereby incorporated herein by reference in its entirety, is operated according to the following write/read schemes:
With reference to FIG. 1 (prior art), in order to write a package of instructions as depicted in the upper portion of the figure, for example a package of 4 unresolved instructions uip(0:3), into an array in one cycle during the dispatch process a cell is needed with as many write ports as the maximum package size, i.e., a number of k1=4 in this case.
A write decode block 22 translates the write address in (0:5) via control line 16, into input pointer wse10 . . . wse13 (0:3) selecting a block of four entries to be written, namely the array entries i, i+1, i+2, i+3. This is depicted schematically in FIG. 1. The first instruction uip0 is written into cell(i) by activating wse10 on input port di0, the next instruction uip1 is written into cell(i+1) by activating wsel1 on input port di1, and so on, see the filled circles.
This scheme guarantees that the data is written consecutively into the array. As buffer memories in general are often used in a wrap-around way of operation some special care is required to cover this case, too.
The wrap-around case is handled by the write decoder 22, as well. If for example the window buffer has the total size of 64 entries and a block of four subsequent entries is intended to be written in starting at 62, then, wse1(0:3) point to entries (62,63,0,1).
The read case is similar as revealed from FIG. 2 which depicts the prior art issue filters if0 to if3 controlling an array of 4-read-port cells by read select lines rsel0(0 . . . 63), rsel1(0 . . . 63), rsel2(0 . . . 63), rsel3(0 . . . 63). The data is read to several data output ports, i.e. Do(0:3) not explicitly depicted. As many read ports are needed as execution units exists, i.e., instruction execution units (ieu) ieu(0:3) in order to get full parallelism and provide data for all execution units every cycle for the issue process. A routing network can connect each output port of the buffer with each execution unit. An arbitration logic is provided for connecting a particular port with the desired execution unit.
In particular, the instructions ready for execution are identified by valid bits depicted in the upper line of FIG. 2 which are passed to the four different issue filters if(0:3). if0 selects the oldest of all instructions 0 . . . 63 ready for execution, activates rse10 and thereby sends the data to the execution units. Filter if1 ignores the entry detected by if0 and selects the second oldest, activates rsel1 and sends it to the execution units, and so on.
Since any entry of the 64 total entries of the buffer can be first, second, third or fourth selected, any entry and therefore any cell needs 4 read ports. This results in an extremely high area consumption and an associated large signal propagation delay.
It is thus an objective of the present invention to decrease area consumption and thus increase the efficiency of storage area utilization.
This objective of the invention is achieved by the features stated in enclosed independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims.
A considerable amount of area can be saved according to the present invention by reducing the number of write ports to the number k1 of concurrently intended write accesses and the number of output ports to the number k2 of concurrently intended read accesses to the array. This remarkable reduction of ports and thus an extraordinary associated area saving can be achieved when the intended array xe2x80x98naturalxe2x80x99 operation can be expected to be performed in particular groups of concurrent accesses. Of course, k1 and k2 can be different but equal as well for an inventional implementation of the buffer access circuitry.
The present invention is thus useful applicable in hardware circuits comprising multiport arrays and multiport registers.
The array accesses are to be performed with concurrent accesses from at most k1, or k2 particular groups, respectively.
A group is defined by a plurality of array locations for which it is insured that only one read or write access will be necessary at a time. The membership to a group is exclusive, one array entry can not be a member of multiple groups in order to achieve the proposed area reduction.
For example, in a n=64 entries comprising buffer denoted from no. 0 to no. 63, a first group of entries may comprise the entries 0,4,8 . . . 60, the second group may comprise entries 1,5, . . . 61, the third group 2,6, . . . 62, and the fourth group entries 3,7, . . . 63.
Now, having the knowledge that during operation of the buffer only xe2x80x98bundlesxe2x80x99 of entries are written or read at a time which follow directly to one another, as is for example at time t=0 a multiple write to entries 23, 24, 25 and 26, or at t=1 a multiple read to entries 44, 45, 46, 47 only one write port and only one read port is needed per entry group as explained above, because, according to the present invention for the reading scheme the read results are aligned to the respective read requesters according to a simple re-wiring scheme, whereas for the writing scheme the data to be written is aligned prior to the array access according to the same or a similar scheme.
Thus, the present invention is based on exploiting the knowledge that in many cases such groups can be identified with some operation analysis, or, these groups are present per se, or, if they are not present a structurization restructuring into such groups can optionally be created by involving additional logic even taking into account some disadvantages which may be caused by the additional logic.
The inventional alignment unit basically comprises a control signal input, a number of k input lines and a number of respective k output lines. Inside, a logic is implemented which switches any of the k input lines to any of the k output lines controlled by the respective control signal. The restriction is, however, that as soon as one input line is associated with a particular output line the rest of the input/output line associations is consequently determined as well. Thus the selection of one input/output association determines all remaining associations which leads to the desired alignment.
Thus, the present invention proposes a new scheme to minimize the number of ports per cell without losing the flexibility to write and read the array in parallel on several, i.e., k1/k2 addresses.
In its general scope with k independent groups and n different requesters being defined the present invention keeps a number of n ports for an array macro, but it reduces the required number of ports from n to the smallest integer which is greater or equal to n/k ports for a cell in the array.