This invention relates to computers which utilize wide instruction words to achieve instruction level parallelism and, more particularly, to methods and apparatus for storing wide instruction words in compressed form and for expanding the compressed instruction words for execution.
One of the approaches to improving microprocessor performance is instruction level parallel processing. Instruction level parallel processing involves execution in parallel of low level machine operations, such as memory loads and stores, integer additions and floating point multiplications. Processors for implementing instruction level parallelism typically include multiple execution units and are controlled by Very Long Instruction Words (VLIW's). Each VLIW specifies the operations that are to be executed in a single cycle and includes multiple operation fields, alternatively referred to as syllables. The source program is typically written in a high level language without attention to operations that can be performed in parallel. The conversion of a source program to machine code which utilizes instruction level parallelism involves scheduling of operations which can be executed in parallel. The scheduling function may be performed by a compiler or by the processor itself. When scheduling is performed by the processor, the processor hardware may become complex. When scheduling is performed by the compiler, the processor simply executes the operations contained in the VLIW. Instruction level parallel processing is described by J. A. Fisher et al in Science, Vol.253, Sep. 13, 1991, pp. 1233-1241 and by B. Ramakrishna et al in the Journal of Supercomputing, Vol.7, 1993, pp.9-50.
For maximum utilization of a processor having multiple execution units, each execution unit should perform an operation on every processor cycle. The execution units of the processor may be fully utilized during computation-intensive portions of a program. In this case, all or nearly all of the operation fields, or syllables, of the VLIW are filled. Other portions of the program may not require all of the resources of the processor. In this case, some of the execution units are idle, and one or more operation fields of the VLIW are filled with a no operation (NOP) code.
FIG. 1 illustrates an example of an instruction word 10 containing syllables S1-S6 which, in turn, contain operation codes for functional units F1-F6 respectively. In the example illustrated, functional units F2 and F4 are not needed to execute instruction word 10 and therefore contain NOP codes.
The number of NOPs in a program may be significant. Storing instruction words with significant numbers of NOPs in memory is wasteful of memory space. To avoid inefficient use of memory, techniques for storing wide instruction words in compressed format have been proposed.
In one conventional approach, compressed instructions are stored with a mask word. The operation syllables of the compressed instruction are stored in consecutive memory locations, or words. The mask word encodes where the operation syllables are inserted in the expanded instruction. The remaining syllables of the expanded instruction are filled with NOP codes. Since the mask word is normally only a few bits wide, two or more mask words can be grouped in the same memory word. This approach is illustrated in FIG. 2. An instruction word pair is stored in compressed format in memory as a mask word 20 followed in consecutive memory locations by operations W00, W02, W0S, W06, and W07 of a first instruction word and operations W12 and W14 of a second instruction word. A mask field 22 in mask word 20 indicates the locations of the operations W00, W02, W0S, W06 and W07 in a first line 34 of instruction cache 24, and mask field 26 indicates the positions of operations W12 and W14 in a second line 36 of instruction cache 24.
Due to the variable length of the compressed instruction format in memory, it is necessary to record the offset to the next instruction address somewhere in the instruction itself. The offset must also be stored in the instruction cache to be able to execute correct program counter sequencing and to maintain coherency between the program counter and the main memory code image. The offset to the next instruction address can be stored in mask word 20 as fields 30 and 32 and can be stored in instruction cache 24 as fields 38 and 40. An instruction compression and expansion technique similar to that shown in FIG. 2 and described above is disclosed in U.S. Pat. No. 5,057,837 issued Oct. 15, 1991 to Colwell et al. and U.S. Pat. No. 5,179,680 issued Jan. 12, 1993 to Colwell et al. The major disadvantage of using the technique shown in FIG. 2 and described above is that consecutive instructions do not correspond to consecutive instruction cache locations, as they are separated by an address difference that depends on the variable length of the instruction. This introduces an artificial alias for instructions that are physically separated by a distance that is smaller than the instruction cache size. For example, in a 1024 line instruction cache, a code section of 1024 instructions will very likely contain aliases to the same cache locations, unless proper padding is performed by the loader. This padding is possible only if empty spaces are left in main memory. In the example of FIG. 2, instruction pair #n occupies a cache hole left by the previous instructions. To achieve this, the assembler is forced to leave empty memory areas to get to the desired address of the cache hole. In the example of FIG. 2, twelve memory words are wasted to avoid a conflicting address for instruction pair #m.
In summary, the technique shown in FIG. 2 and described above has several disadvantages. The instruction cache must have a larger capacity to store the offset to the next instruction address. Program counter sequencing is complicated because it needs to compute the next instruction addresses. Also, the variable instruction length introduces artificial aliases in the instruction cache. And, if the loader pads instructions in main memory to avoid the problem of artificial aliases, holes are created in main memory.
In addition, the scheme of FIG. 2 requires the allocation of a fixed number of bits for the bit mask, which can lead to high overhead when there are only a few syllables in the instruction which are not NOPs. This scheme also requires hardware for dispersal of the instruction that occupies a large circuit area and is not easily amenable to bit interleaving.
Accordingly, a need remains for the storage of VLIWs in a compacted format wherein at least a portion of the instruction syllables containing NOPs are eliminated and wherein the compacted instructions are stored sequentially in instruction memory.