1. Field of the Invention
The present invention generally relates to stored program digital computers and, more particularly, to a computer capable of splitting a single sequence of instructions into two or more sub-sequences and of processing the sub-sequences in parallel.
2. Description of the Prior Art
Many kinds of parallelism have been exploited in pursuit of greater computational speed in a digital computer. A bit-serial computer is a processor without any parallelism. Such a processor will take eight steps to handle an eight-bit byte and, in general n steps to handle an n-bit word. Some of the earliest electronic computers were bit-serial machines. A first type of parallelism that was exploited was to process more than one bit per step. For example, the S/360 family of processors, manufactured by International Business Machines Corporation (IBM), provided a range of speeds partly by varying the number of bits handled per step. These processors handled eight bits per step at the low end and 64 bits per step at the high end. This kind of parallelism has recently been a key factor in the design of microprocessors.
In another kind of parallelism used to increase the speed of a processor, the next subsequent instruction to be executed is read from memory (fetched) while a current instruction is being executed. Since the fetching of an instruction must always precede its execution, there is always some speed to be gained by performing these operations in parallel. This kind of parallelism leads directly to a third kind of parallelism, in which the execution of an instruction is divided into several sub-steps, for example, Instruction Decode, Address Generation, Cache Access, and Algorithm Execution. Each of these sub-steps uses separate hardware in the processor so that it may be done in parallel with other parts of preceding and succeeding instructions.
For optimal performance, it is desirable for a pipeline processor to be simultaneously executing a sequence of several instructions, overlapping the instruction decode operations of one with the address generation of another and the algorithm execution operations of yet another. When a sequence of instructions includes a branch instruction which will be taken, the instructions in the sequence following the branch instruction will not be executed. In this instance, the processor must discard any partial results from the execution of these instructions and restart the pipeline at the target address of the taken branch instruction.
To avoid discarding partial results when a taken branch is encountered, some pipeline processors take advantage of an observed property called "persistence of behavior." Persistence of behavior is a term used to describe the tendency of a processor to repeat previously executed instructions and re-access previously accessed data. This repetitive behavior is pervasive and it has been observed in many situations. As an example of how this behavior may occur, consider a large processor running in a multiprocessor environment under the control of an operating system. In this environment, the processor would spend much of its time executing modules from the operating system. For example, very frequent use would be made of the Task Dispatcher and Lock Manager modules. These modules implement tasks in the operating system that would be impractical to individually program for each application. Clearly these modules and similar modules of the operating system are used over and over again.
There are at least two ways in which persistence of behavior is used to improve processor performance. One is the cache memory and the other is the branch history table (BHT). A cache memory stores instructions and data that have been recently accessed by the processor with the expectation that a considerable fraction of those instructions and data will be used again in the near future. When this is the case--and it is very often--the cache can supply the requested instructions and data very quickly.
The BHT stores information about branch instructions that the processor has encountered with the expectation that the processor will encounter many of the same branch instructions in the near future and that the outcome of executing the branch instruction will be the same. The stored branch information is used to reduce processor delays resulting from the need to restart a pipeline due to changes in program flow caused by a taken branch instruction. As long as the BHT entries remain valid, a pipeline processor using a BHT may proceed uninterrupted through many branch instructions.
A fourth type of parallelism is to decode two or more successive instructions during each cycle. This type of parallelism assumes that each of the sub-steps, referred to above in the discussion of pipeline processors, may be performed in one machine cycle. This is a common capability of high performance processors. This type of parallelism requires two or more sets of instruction decoding hardware and may require two or more sets of address generation hardware, cache accessing hardware and algorithm execution hardware.
It is noted that each kind of parallelism set forth above is separate and distinct in that it can be used in combination with each other kind. In fact, all four kinds of parallelism are used in many high performance processors; each kind contributing its own performance advantage.
U.S. Pat. No. 3,559,183 to Sussenguth and assigned to the assignee of the present invention, describes a pipeline processor which uses a BHT to choose instructions to be executed. The use of a BHT is based on the observation that most branches, considered individually, are consistently either taken or not taken and, if taken, have a consistent target address. The BHT is a table of taken branches. Each entry in the table includes the address of the taken branch followed by its target address. This table is a hardware construct and, so, it has a fixed size, typically from 1024 to 4096 entries. Entries are made only for taken branches as they are encountered. When the BHT is full, making a new entry requires displacing an existing entry. This may be accomplished, for example, by evaluating the entries on a least recently used (LRU) basis as in a cache memory.
U.S. Pat. No. 4,107,773 to Galbreath et al. describes a processor which employs a memory having two independent sections and processing hardware which operates in parallel using one section in conjunction with an arithmetic unit while using the other section in conjunction with data transfer to and from an external memory. The apparatus described in this patent is an example of the third type of parallelism described above.
U.S. Pat. No. 4,295,193 to Pomerene, assigned to the assignee of the present invention, relates to a processor design for simultaneously executing two or more instructions. The instructions to be executed are divided into groups having, at most, n instructions each, for example, during compilation. Each group may have only a predetermined number of data and instruction fetches and, if the group contains a branch, it must be the last instruction. Each instruction in a group uses separate instruction execution hardware. This is an example of the fourth type of parallelism described above.
U.S. Pat. No. 4,679,141 to Pomerene et al. concerns an optimized BHT for a pipeline processor. The BHT described in this reference includes an active area which contains entries for a small number of branches which the processor may encounter in the near future and a backup area which contains all other entries. Entries are brought into the active area in advance of-when they may be encountered by the processor. As entries are removed from the active area they are put into the backup area. The relatively small size of the active area allows it to be designed for speed and to be optimally located within the processor hardware.
U.S. Pat. No. 4,766,566 entitled "Performance Enhancement Scheme for a RISC type VLSI Processor" by C. M. Chuang relates to a reduced instruction set computer (RISC) design which uses two execution units to process two instructions in parallel. One execution unit may handle any instruction the other execution unit may include only a subset of the hardware in the first unit and may therefore be limited to processing only some types of instructions.