Highly parallel computer architectures, such as N-way superscalar, VLIW, and DATAFLOW processors perform many load operations and/or storage operations simultaneously with every cycle and so require cache memories with very high bandwidth. Considering the VLIW architecture as exemplary, it exploits fine-grained parallelism in order to perform many load operations and/or storage operations simultaneously per cycle.
Traditional parallelism in computing systems involves multiple processors. The processors are connected by a communications network and are controlled so that more than one processor is active at the same time. To use a traditional parallel processor on a particular application, a programmer or a sophisticated compiler must break the problem into pieces, and set up appropriate communications and controls. In general, this approach is most easily applied to problems which divide naturally into large pieces which have little need to communicate with each other. Many applications, however, cannot be structured this way. As such, traditional parallel processing has not be very effective except for "scientific" numerical methods and other highly structured problems.
Fine-grained parallelism is an extension of instruction-level parallelism in several respects. A compiler discovers machine instructions within a program which may be executed at the same time. These separate instructions are put together into a compound instruction, the "very long instruction word" (VLIW). Each part of a VLIW instruction controls a separate part of the hardware in the computer, such as an arithmetic logic unit ("ALU"), a path to main storage, or a path to a register. In one VLIW machine cycle, these separate resources within a VLIW computer can all be used independently, so several basic machine instructions can be executed simultaneously. This confers the advantage that a task can be completed in fewer machine cycles than is possible on a traditional unit processor uniprocessor.
The VLIW technique improves performance on a uniprocessor. As such, it can be viewed as exploiting intra-computer parallelism, which is another term for fine-grained parallelism. In contrast, traditional, or coarse-grained, parallelism can be viewed as inter-computer parallelism. Of course, it is possible to exploit intra-computer and inter-computer parallelism by connecting several VLIW computers as one operating unit.
FIG. 1 is a block diagram depiction of a prior art VLIW architecture, including the associated memory hierarchy. A register file 110 is connected to the functional units 102-108. More particularly, each of the load/storage unit 102, the floating point addition unit 104, the integer ALU 106, and the branch unit 108 have two load lines delivering data from the register file 110 and one storage line for storing data from the functional units to the register file 110. The register file 110 is referred to as the LO memory level.
FIG. 1 also includes the L1 memory level, or cache, 112, which is bi-directionally connected to the register file 110 and the L2 memory level 114. The L2 memory level 114 is bi-directionally connected to the L3 memory level 116. The L3 memory level 116 is also bi-directionally connected to the L4 memory level 118. The general principles of the memory hierarchy depicted in FIG. 1 are well known.
Historically, two factors have affected the speed at which a computer operates, the first factor is the speed at which the processor, e.g., its slowest functional unit, can operate upon data that is presently available. The second factor is the speed at which data needed by a functional unit can be obtained. Historically, it has been more difficult to improve the second factor rather than the first factor.
The VLIW concept recognizes that, for a given memory access time, you can improve your effective operating speed if you process more pieces of data at the same time. For example, eight functional units will operate upon sixteen pieces of data in one cycle. In contrast, a computer that can only use one of its functional units in any one cycle will require eight cycles to operate upon sixteen pieces of data, given a rate of two data pieces per cycle. In the example, the VLIW computer is eight times faster than the traditional computer despite having the same memory access time.
Yet the VLIW computer can only achieve the improvement if there is a cache that can supply sixteen pieces of data at the same time to the eight functional units, respectively. A true eight-, or more, ported cache array has been thought to be impractical from the perspectives of wireability, performance, power density, and noise tolerance at high speeds, e.g., less than one nanosecond access time. A true eight-, or more, ported cache array is extremely complex. It has substantial performance problems, because e.g., N ports tend to place heavy loads upon the current sourcing ability and capacitance of the cell output device. In terms of laying out the circuitry, any one output port must be able to drive any one of the pipes/busses.
In contrast, forming the cache into separately-functioning parts, or modules, i.e., interleaving, requires that each interleaf drive only one pipe line. By spreading out the cache via interleaving, the wireability improves, thereby improving performance. Each cell can be built smaller because only one or two output ports need to be driven.
It is an object of the present invention to provide a cache memory that can accommodate multiple simultaneous memory accesses, e.g., load operations to a VLIW computer.
It is an object of the present invention to provide a device that functions equivalently to a true 8-ported-and-interleaved cache memory but one that is not truly 8-ported, i.e., a pseudo-8-ported and interleaved cache memory that can accommodate 8 simultaneous memory accesses.
It is an object of the present invention to provide a pseudo-, 16-, or greater, ported and interleaved cache memory that can accommodate 16, or more, simultaneous memory accesses.
It is an object of the present invention to provide a method and apparatus for generating effective addresses at which to store data in an 8-, 16-, or greater, ported and interleaved cache memory.
It is an object of the present invention to provide a method of compiling a program to optimize use of an 8-, 16-, or greater, ported and interleaved cache memory.