The present invention relates generally to the field of computer memory systems and more particularly to computer memory systems for storing and delivering instructions to a central processing unit.
In spite of the many improvements which have been made in computing systems, modern computers differ little from the classical Von Neumann machines. A Von Neumann machine consists of a central processing unit and a memory. The computing power is concentrated in the central processing unit. The memory merely stores the instructions and data used by the central processing unit. The memory consists of a contiguous block of fixed length data words. Each data word is accessed by specifying the location of the desired data word relative to the first data word in the memory. The location of a desired data word is normally referred to as the address of the data word. The central processing unit accesses the data and instructions stored in the memory by computing the address of the desired data and then communicating that address to the memory unit. This division of labor between the memory and the central processing unit is inefficient for a number of reasons.
First, a considerable fraction of the typical central processing unit's time must be devoted to calculating addresses. While the central processing unit is calculating an address, it is not free to perform other tasks, since the hardware needed for address computations is also needed for most of the other instructions carried out by the central processing unit. In addition, the hardware needed to carry out the address calculations significantly increases the physical size of the central processing unit. This is particularly important in VLSI designs in which the central processing unit is constructed on a single chip. As the size of the chip increases, the speed at which the central processing unit can operate, in general, decreases. This decrease in speed results from the longer connecting paths which are needed to connect the various elements on the chip. These long paths contain parasitic capacitances which must be charged and discharged each time the logic level on the conductor in question is changed. The time to charge and discharge these capacitances limits the speed of the central processing unit. Further, cost of a VLSI circuit is directly related to the physical size of the chip on which it is located. This cost increases rapidly with increasing chip size.
A second source of inefficiency arises from the use of the same memory for storing both instructions and data. The number of bits in each word is the same whether it is used for storing an instruction or a data value. However, the optimum size for a data word will in general be quite different from the optimum size of a storage word for an instruction. As a result, a compromised word length must be used.
The use of the same memory for both data and instructions may also result in a slower data processing system. In general, the time needed to access a specific memory word in a VLSI memory increases with the physical size of the memory. Hence, if the instructions are located in the same large memory as the data on which these instructions operate, the time needed to fetch an instruction will, in general, be longer than it would be if the instructions were stored in a small memory because the longer signal paths needed to access the additional memory cells result in parasitic capacitances. In addition, since both the data and instructions use the same bus, data and instructions may not be concurrently accessed which further reduces processing speed. These problems may be overcome to some degree by the use of small high speed cache memories which act as a buffer between the large slower memory and the central processing unit. In such systems, blocks of data and/or instructions are transferred to the cache memory by a separate processor while the central processing unit is executing other instructions. However, for this strategy to be successful, one must have some way of predicting which blocks of data or instructions will be needed next. This problem has not been adequately solved in prior art systems employing such cache memories.
The processing delays incurred while the central processing unit calculates the address of the next instruction and fetches this instruction may be overcome to some extent by the use of a "pipelined" central processing unit. A typical pipelined central processing unit can process several instructions concurrently. It is divided into stages. The first stage is typically devoted to fetching the next instruction to be processed. This instruction is then sent to the second stage which decodes the instruction. That is, it converts the instruction "opcode" to a series of binary bits which are placed in an internal register. These bits are used to control the various gates in the third stage which is responsible for the execution of the instruction. A final stage which is responsible for storing results back in the memory is also sometimes included in such pipelined processors.
A four stage pipelined processor of the type described above can work on four instructions concurrently. Each instruction requires four memory cycles to complete. On each memory cycle, a new instruction is inputted to the processor and one of the old instruction previously entered into the processor will have been completed. Hence, the pipelined processor effectively executes one instruction per memory cycle even though the time to complete a single instruction is four memory cycles. This is the result of concurrently performing calculations on four instructions at a time. In the first memory cycle, the next instruction is fetched into the first stage. The instruction which entered the processor on the previous memory cycle will have been passed on to the second stage where it will be decoded during this memory cycle. The instruction which entered the processor two memory cycles earlier will now have been passed on to the execution stage where it will be executed during this memory cycle. Finally, the instruction which entered the processor three memory cycles earlier will now have entered the final stage in which results are stored back in the memory.
This improvement in throughput is only fully realizable when the processor is executing a program which does not contain a large number of "jump" instructions. As used herein, a jump instruction is any instruction whose execution results in an instruction other than the instruction following said jump instruction being executed next. Pipeline processors are based on the assumption that the next instruction to be executed is stored in the memory at a location immediately after that at which the current instruction was stored. A counter is maintained which specifies the address of the next instruction to enter the pipeline. Each time an instruction enters the pipeline, this counter is incremented. Although the next instruction is usually the one following the last instruction in the memory, there are a large number of cases in which the next instruction is located elsewhere in the memory. These cases occur when a jump instruction is encountered. Jump instructions may either be conditional or unconditional. An unconditional jump instruction specifies the location for the next instruction to be executed which is different from the next sequentially stored instruction in the memory. A conditional jump instruction specifies that the next instruction is to be the next sequentially stored instruction in the memory unless a specified condition is met. If the specified condition is met, the next instruction is to be the one located at the address specified in the jump instruction.
In the case of the four stage pipeline processor described above, a jump instruction will not be executed until it is in the execution stage of the processor. By this time, the two instructions stored in memory after the jump instruction in question will have also entered the pipeline. These are the wrong instructions if the jump specified in the jump instruction is executed, since the next instruction to be executed after the jump instruction is the one specified in the jump instruction, not the one following the jump instruction in the memory. Hence, when the jump instruction is executed, the instructions already in the pipeline must be discarded and the pipeline refilled starting with the instructions at the address specified in the jump instruction. As a result, at least two memory cycles will be lost, i.e., the time needed to load and process the two instructions which replace the two instructions which were discarded after the jump instruction was executed.
Jump instructions also complicate the use of cache memories. Prior art cache memory systems are transparent to the central processing unit. The cache memory is a fast memory which is inserted into the system between the central processing unit and a slower large capacity memory. When the central processing unit requires a data word, it places the address of the data word in question on a bus which is monitored by the cache. If this word is already stored in the cache memory, it is sent from the cache memory to the central processing unit. If the data word in question is not already in the cache, the central processing unit must wait while it is loaded into the cache. The extent to which such a cache memory can be used to increase processor speed depends upon its ability to predict the next block of data words which will be needed by the central processing unit. Unless the cache memory processor, which makes this prediction can recognize jump instructions and determine the address to which the jump will be made, it can not make an accurate prediction. Hence, it will not have loaded the appropriate data words when a jump instruction is executed by the central processing unit. Thus, to be effective, the jump instruction recognition and decoding logic which is present in the central processing unit must be duplicated in the cache processor.
Broadly, it is an object of the present invention to provide a memory system which is optimized for the delivery of instructions to a central processing unit.
It is a further object of the present invention to provide an instruction memory system in which the width of the data words used to store the instructions is matched to the instruction set of the central processing unit to which said memory system is connected.
It is a still further object of the present invention to provide an instruction memory system in which jump instructions are executed in the memory system thereby eliminating the need to provide duplicate hardware in the central processing unit for executing such jump instructions.
It is yet another object of the present invention to provide an instruction memory system which substantially reduces the number of situations in which the pipeline of a pipelined processor must be emptied and refilled in response to a jump instruction.
It is yet another object of the present invention to provide an instruction memory system which relieves the central processing unit of the task of calculating the address of the next instruction to be executed by the central processing unit.
These and other objects of the present invention will become apparent from the following detailed description of the present invention and the accompanying drawings.