Computer processors process information faster than the information can be supplied to the them. Much effort has been directed toward reducing latency, i.e., reducing the time that a processor is idle while waiting for the information it needs to proceed.
The technique of multi-level caching of instructions and/or data has been developed to reduce latency. Caches are fast memories of varying speeds, located in close proximity to the processor, that store instructions and/or data with a high probability of being needed by the processor. A level one or L1 cache is physically located closest to the processor, retrieves information the fastest, but also is the most expensive (in terms of the number of memory cells provided per unit of chip surface area), and hence also is the smallest of the caches. The next level of cache, L2, does not retrieve information as fast, is less expensive per unit area, and is larger than the L1 cache. If present, an L1.5 cache is located between the L1 and L2 cache, is smaller, faster and more expensive than the L2 cache, and bigger, slower and less expensive than the L1 cache.
Another technique for reducing latency is to prefetch or load instructions to the L2 instruction cache (Icache) before the processor attempts to retrieve them from the Icache to increase the likelihood that the instructions will be in the Icache when needed. This entails inspecting instructions that will be executed two or three cycles in the future, and determining if a branch to an instruction other than the subsequent serial instruction will take place. If no branch will take place, then the instruction serially subsequent to the inspected instruction will be loaded into the L2 Icache. If branching will take place either because the branch is an unconditional branch or a conditional branch of a loop, then the branched-to instruction is fetched. The essential characteristic of prefetching is the determination of whether a simple branch will occur.
One fast computer architecture is a hybrid architecture having a single CPU with characteristics of both a uniprocessor and a parallel machine. In this approach, a single instruction register and instruction sequence unit execute programs under a single flow of control, but multiple arithmetic/logic units (ALUs) within the CPU can each perform primitive operations simultaneously. Rather than relying upon hardware to determine all the simultaneous operations that can be executed, a compiler formats or groups the instructions before execution to specify the parallel operations. Because the instruction word held in the instruction register must specify multiple independent operations to be performed by the different ALUs, this approach employs a very long instruction word, and is commonly known as very long instruction word (VLIW) architecture.
A central processing unit (CPU) of a computer system utilizing a VLIW architecture includes an instruction register large enough to hold a VLIW, an instruction sequencing unit, a bank of data registers, a set of arithmetic/logic units (ALUs), and instruction decode logic. The instruction register holds machine-level instructions for execution by the CPU. The bit positions in the instruction register correspond to a set of parcels or fields, each parcel corresponding to a different respective one of the ALUs. The operation, if any, performed by each ALU during a machine cycle is specified by the corresponding parcel.
Each parcel of an instruction in the instruction register may contain such information as an operation code (op code), source and destination registers, special registers such as condition code registers, immediate data, storage addresses, etc. In order to reduce the total number of bits required to store an instruction, at least some of the bit positions required to specify such information to the instruction decode logic are implied by the position of the parcel within the instruction word.
A VLIW architecture, can in many applications, achieve greater parallelism and greater speed than multiple independent processors operating in parallel. The theory underlying VLIW is that the typical application program has a single flow of control, but many of the primitive operations within that flow can be performed in parallel. Therefore an automated compiler for a VLIW machine does not have to alter program flow (something which has been almost impossible to automate in parallel processor machines). It only has to determine which primitive operations can be performed in parallel. While even this is a difficult task in practice, it lends itself to automation much more readily than the altering of program flow.
VLIW designs employ a large instruction word for several reasons. First, each of the ALUs requires its own command, which can include an operation code, source and destination designations, etc. Second, there must be a conditional branching mechanism appropriate to the VLIW architecture. Because many simple operations are being performed with each instruction, the effectiveness of a VLIW machine would be limited if only one conditional branch were allowed in a given instruction, as is usually the case in a conventional von Neumann computer. Therefore it is desirable to permit conditional branching to multiple destinations from a single VLIW instruction, a characteristic referred to as N-way branching. Of course, all of the branch conditions and destinations must in some way be specified in the instruction. Third, because a theoretically pure VLIW design employs a large pool of data registers, and other special registers, any of which can be assigned arbitrarily as source or destination for the various operations, the number of bits in the instruction required for identifying each source and destination register is greater than for a conventional von Neumann design employing a smaller number of registers.
With the development of Very Long Instruction Word (VLIW) computers, much greater demands have been placed upon the supporting hardware, such as memory, buses, etc., but especially instruction cache (Icache). A VLIW computer drains an Icache about five times faster than a conventional von Neumann computer because of the parallelism inherent to VLIW computation.
About half of the instructions in a VLIW ultimately contribute no useful work to the processor. As a result, compared to an Icache holding instructions of a von Neumann computer, an Icache holding VLIWs holds about 50% less instructions that are likely to contribute useful work. Yet this situation is tolerable because of the massive parallelism achieved.
As alluded to above, it is a characteristic of VLIW computers that each VLIW has at least one branch target, and usually two or three. It is difficult to predict if the typical VLIW will cause a branch to occur, and if so, what will be the address to which the branch goes. It is a problem that, when a typical VLIW branches, the latency associated with retrieving the branched-to VLIW into the pipeline causes the processor to slow significantly.