1. Technical Field
The present invention relates in general to an improved data processing system, and in particular to a method and system for efficiently fetching variable-width instructions in a data processing system having multiple instruction prefetch elements. Still more particularly, the present invention relates to a method and system for reducing the program memory space required to implement a data processing system having a number of prefetch elements for fetching variable-width instructions for a central processing unit which may execute one instruction per cycle time, during the execution of multiway branch instructions.
2. Description of the Related Art
Recently, a new data processing system architecture, called of ring of prefetch elements (ROPE), has been disclosed in an article entitled "Getting High Performance With Slow Memory," by Kevin Karplus and Alexandru Nicolau, published in COMPCON, May, 1986, at pages 248-253. The purpose of the architecture is to provide a data processing system capable of sustaining an instruction execution rate of one instruction per cycle time, even during the execution of multiway branch instructions. Such a data processing system may be useful for processing real-time video and multimedia presentations. This ROPE architecture is illustrated as prior art in FIG. 1.
As illustrated, ROPE architecture 20 includes "M" number of prefetch elements 22, 24, 26, and 28, for fetching instructions from an associated memory bank. The determination of the number of prefetch elements (i.e., the number "M") required for a particular application of the ROPE architecture is discussed below in greater detail. Memory banks 30-36 are each associated with one prefetch element 22-28, respectively, and are utilized for storing instructions only. Separate data memory banks may be provided for storing program variables or other temporary or variable data. An advantage of this ROPE architecture is that memory banks 30-36 may be implemented by memory devices which require a relatively long period of time to fetch data when compared to the cycle time of the CPU which receives the instructions. As utilized herein, "cycle time" means the minimum duration between two successive instruction requests by the CPU.
Once prefetch elements 22-28 fetch instructions from memory banks 30-36, such instructions are placed on instruction bus 38 at an appropriate time determined by logic circuits within prefetch elements 22-28 and control signals on control bus 40 and instruction bus 38. Instruction bus 38 is coupled to data path 42, which is part of CPU 44. Control bus 40 receives control information 48, which may include condition codes and the instruction pointer, from CPU 44.
Instructions stored within memory banks 30-36 may be very large (i.e., long). In some implementations, the instruction word may be 512 bits wide, and include a plurality of fields. At least one of such fields contains an instruction to be executed by CPU 44. CPU 44 may also be able to execute instructions represented by instruction words that comprise bits in several fields. Other fields in the instruction word may contain instructions addressed to one of prefetch elements 22-28. Such instructions may command the prefetch element to initiate a prefetch from an associated memory bank or command the prefetch element to place an instruction on instruction bus 38 after execution of a multiway conditional branch.
An advantage to utilizing an architecture that supports a very long instruction word (i.e., a VLIW architecture) is the ability to provide fine-grain parallelism between parallel operations being executed in software. However, VLIW architecture also has disadvantages. One disadvantage of the VLIW architecture is low efficiency memory utilization which may result when fields in the very long instruction word are not utilized.
In a typical implementation the VLIW architecture, each location in program memory must be configured to store a long instruction word comprised of bits grouped into several fields. However, not every field of every long instruction word will contain instruction information. As the number of program memory locations having fields without instruction information increases, the efficiency of program memory utilization decreases.
In operation, ROPE architecture 20 is capable of supplying data path 42 with one instruction per cycle time, even after the execution of a multiway conditional branch. One instruction per cycle time is placed on instruction bus 38 by a selected prefetch element. Prefetch elements 22-28 are selected to place an instruction on instruction bus 38 by the reception of an "activate" token. Such an activate token is a particular message or bit pattern that signifies permission to place an instruction on instruction bus 38. A prefetch element is not permitted to place an instruction on instruction bus 38 unless that prefetch element holds the activate token. Since only one prefetch element holds the activate token at a given time, only one prefetch element is permitted to place instruction data on instruction bus 38 at any given time.
In a "non-branch" mode of operation, wherein a branch instruction is not currently anticipated within a specified number of cycle times, the activate token is passed from one prefetch element to a next adjacent prefetch element, via control bus 40, continuing in this manner until all of the M prefetch elements have received the activate token, and all prefetch elements have been allowed one cycle time to place a single instruction on instruction bus 38. Once the last (i.e., the Mth) prefetch element has placed on instruction bus 38, the process continues around the ring of prefetch elements, where the first prefetch element is again permitted to place an instruction on instruction bus 38, without allowing a cycle time to pass without placing an instruction on instruction bus 38.
For example, if prefetch element 22 holds the activate token, prefetch element 22 places a single instruction from memory bank 30 onto instruction bus 38, which may then be received by data path 42. Thereafter, the activate token is passed from prefetch element 22 to prefetch element 24, and prefetch element 24 is allowed to place an instruction from memory bank 32 on instruction bus 38 during the next cycle time. Once prefetch element 28 has received the activate token and placed an instruction from memory bank 36 on instruction bus 38, the activate token may be passed around the ring to prefetch element 22. In such a manner, the process of executing non-branch instructions may continue indefinitely.
Because relatively slow memory devices may be utilized to implement memory banks 30-36, prefetch elements 22-28 typically begin the process of fetching an instruction from an associated memory bank several cycle times before that prefetch element is scheduled to receive the activate token and place the fetched instruction on instruction bus 38. Therefore, a "prefetch token" is utilized to initiate a memory access by a selected prefetch element holding such a prefetch token. Prefetch tokens are passed to prefetch elements several cycle times before a prefetch element may receive the activate token. Thus, in non-branch instruction execution, the prefetch element holding the prefetch token precedes the prefetch element holding the activate token, in the ring of prefetch elements, by a number of prefetch elements at least equal to the number of cycle times required to fetch an instruction from memory banks 30-36.
For example, if prefetch element 28 holds the activate token, and is currently placing an instruction on instruction bus 38, the prefetch token is typically located several prefetch elements ahead of prefetch element 28 in the ring of prefetch elements. Thus, a prefetch element, such as prefetch element 26, holding the prefetch token may begin fetching an instruction from memory bank 34 three cycle times before the time the activate token may be received by prefetch element 26. The number of prefetch elements by which the prefetch token precedes the activate token depends upon the speed of memory utilized in memory banks 30-36 and the cycle time of the CPU. Typically, the prefetch token precedes the activate token by a number of prefetch elements equivalent to the number of cycle times required to fetch an instruction from a memory bank.
In a "branch" mode of operation, multiple prefetch elements, within the group of M prefetch elements 22-28, must begin to fetch an instruction in anticipation of placing a fetched instruction on instruction bus 38 during the cycle time immediately after the CPU determines which branch in the program will be executed. For example, if a three-way branch is forthcoming, three prefetch elements will receive three different prefetch tokens, each of which instructs a prefetch element to begin fetching an instruction from its associated memory bank. Once the CPU determines which one of the three branches to execute, one of the three prefetch elements will receive the activate token, and that prefetch element will place the next instruction on instruction bus 38, thereby enabling the CPU to continue executing instructions in the selected branch without waiting for instructions to be fetched from memory. The two prefetch units that were not activated are then made available to receive new prefetch instructions.
The number of prefetch elements required to implement the ROPE architecture depends upon the number of branches that may be selected during a conditional branch instruction and the number of cycle times required to fetch an instruction from a memory bank. If the data processing system CPU supports B-way conditional branching and the memory access requires C-cycles to fetch an instruction from a memory bank, then at least B*C prefetch elements are required to be able to prefetch instructions necessary to execute one B-way branch while maintaining an execution rate of one instruction per cycle time. For example, if the CPU supports three-way branching (B=3), and the memory requires four cycle times to fetch an instruction (C=4), then the number of prefetch elements required is at least 3*4, or 12.
Referring now to FIG. 2, there is depicted a prefetch schedule for performing a series of operations and branch instructions in a data processing system utilizing a plurality of prefetch elements as illustrated in the architecture of FIG. 1. In this figure, a sequence of nine cycle times, cycles A-I, are depicted vertically. During such cycle times, operations 50-72 and branch instructions 74 and 76 may be executed within CPU 44 (see FIG. 1), depending upon which branch is taken at branch instructions 74 and 76. In this example, operations 50-72 are able to complete execution within one cycle time.
Branch instruction 74 illustrates a multiway branch instruction, which, in this example, is a three-way branch instruction. Therefore, after cycle E, the program may execute operation 58, or operation 62, or operation 66, depending upon the outcome of tests performed at branch instruction 74. Multiway branches are made possible by CPUs which may execute a set of prespecified tests during a single cycle time. For example, branch instruction 74 may determine whether the result of operation 56 is less than zero, equal to zero or greater than zero, and then branch accordingly. Thus, operation 58 may be executed if the result of operation 56 is less than zero, or operation 62 may be executed if the result of operation 56 is equal to zero, or operation 66 may be executed if the result of operation 56 is greater than zero.
During instruction fetches 80-98, prefetch elements 22-28 (see FIG. 1) provide address signals to associated memory banks 30-36, respectively, and receive instruction words during subsequent cycle times. If the speed of memory banks 30-36 is such that four cycle times are required to fetch an instruction, then an instruction prefetch operation, which is conducted by prefetch elements 22-28, must be initiated four cycle times before the instruction is to be placed on instruction bus 38. Therefore, as illustrated in FIG. 2, instruction fetches 80-98 are initiated four cycle times before they are placed on instruction bus 38. As illustrated in cycle B, three instruction fetches 82-86 are initiated in anticipation of branch instruction 74, which is a three-way branch instruction.
It may also be seen in FIG. 2 that several instruction fetches may be in various stages of completion during any given cycle time. For example, during cycle E, instruction fetch 80 is complete, instruction fetches 82-94 are in process, and instruction fetches 96 and 98 have just been initiated. Thus, during cycle E, ten instruction fetches, which are conducted utilizing ten prefetch elements, are in various stages of operation. Those persons skilled in the art will recognize that additional instruction prefetches, which are not illustrated, may be performed during cycle E for operations which follow operations 60, 64, 70, and 72. Additionally, instruction fetches for operations 50-56 are not shown in FIG. 2.
Thus, a person of ordinary skill in the art should appreciate that in order to sustain consecutive B-way branch instructions at a rate of one B-way branch per cycle time in a data processing system utilizing program memory that requires C-cycles to fetch an instruction, the number of prefetch elements required approaches B.sup.C+1 prefetch elements. Even for a data processing system that permits three-way branch instructions and utilizes memory that requires four cycles to fetch an instruction, approximately 240 prefetch elements would be required.
Turning now to FIGS. 3, 4, and 5, there are depicted three data processing architectures known in the prior art. As illustrated in these figures, a data processing system may be modeled as a union of control path and data path. The control path may be modeled as a finite state machine which generates control signals for the data path and reacts to status signals from the data path. Different classes of data processing system architectures may be differentiated by the modeling of the different types of control path structures.
FIG. 3 illustrates a high-level state machine model of a single instruction path/single data path (SISD) data processing system, which is widely utilized in many conventional data processing systems. The basic Moore machine is found in the control path of the model of a classical microprogrammed SISD uniprocessor shown in FIG. 3. In this model, program memory output function 110 depends only on the state variable supplied by program counter 112. Thus, for a given value of program counter 112, a given instruction will execute within various functional units within data path 114.
Next state function 116 is a function of condition codes 118 from data path 114, control path state 120 from program counter 112, and external inputs (not shown). Data path 114 may be comprised of a plurality of functional units which perform a wide variety of operations on multiple data types. Data path 114 is essentially capable of performing all of the operations of a RISC type processor, including loads, stores, and branches. Data path 114 may be able to execute one data operation and one control operation (branch) per cycle.
FIG. 4 depicts a high-level state machine model of a very long instruction word (VLIW) data processing system. The VLIW processor has multiple functional units within data paths 124-128, each of which are similar to data path 114 of the SISD processor of FIG. 3. The VLIW model control path portion contains a separate program memory output function 130-134 for each data path 124-128. Condition codes 135 from all data paths 124-128 serve as inputs to next state function 136. In an actual implementation, this implies that condition code and status information from each data path 124-128 feeds back into the instruction sequencer.
FIG. 5 illustrates a high-level state machine model of a variable instruction stream/multiple data stream (XIMD) data processor. The XIMD architecture is an extension to the VLIW architecture (see FIG. 4) that allows for a variable number of instruction streams. The variable number and variable-width of the instruction streams and the low synchronization overhead offered by the XIMD architecture make it feasible to support both fine- and medium-grained machine parallelism. The VLIW architecture provides a sequencer for each data path and adds a distribution network for condition code bits and software-set synchronization bits. The condition codes bits result from operations performed on run-time data; software-set synchronization bits are provided directly from the instructions. Each instruction has a field for the sequencer and a field for the data path. The sequencer determines the next instruction address as a function of the sequencer field, the current instruction address, and the software-set synchronization bits and condition code bits of all functional units. These augmentations allow greater capability and flexibility in managing the flow of control than are available for a VLIW.
To synchronize instruction streams in multiple data paths in the XIMD architecture, three forms of synchronization can be effectively implemented: implicit, explicit, and barrier synchronization. Implicit synchronization is possible because the common clock advances each instruction stream at the same rate. As long as instruction streams are kept in lockstep on cache misses and exceptions, the relative delay between operations in different instruction streams remains fixed. The scheduler can implicitly synchronize two dependent operations by creating and maintaining the necessary relative delay.
Explicit synchronization can be used to delay one instruction stream until it receives a signal from another using the sequencer field of an instruction in each instruction stream, the condition code bits, or by communicating through the global register file.
Barrier synchronization is a special case of explicit synchronization. Barrier synchronization is implemented on the XIMD by having each functional unit set its software-set synchronization bit reaches the barrier. The functional unit waits by repeatedly executing a conditional branch instruction that jumps to itself until all software-set synchronization bits are set. The synchronization overhead for explicit synchronization on the XIMD is on the order of a few cycle times, depending on the branch delay and the latency of the condition code and software-set synchronization bit distribution network. A typical multiple instruction stream/multiple data stream machine uses explicit synchronization in the form of semaphores, communication through memory, or operating system calls to synchronize processes on different processors. Synchronization by these techniques result in overhead that ranges from ten to hundreds of cycle times, making the granularity of parallelism that can be effectively exploited rather coarse. The XIMD's low synchronization overhead makes the exploitation of fine-grained parallelism and the coordination of instruction streams on a cycle-by-cycle basis feasible.
When compared to the VLIW processor of FIG. 4, program memory output functions 142-146, and data paths 148-152, are unchanged. The remaining portion of the control path, program counter 154 and next state function 156, have been duplicated for each data path 148-152. This results in separate program counters 154, 158, and 162, for each data path 148-152. Also, separate next state functions 156, 160, and 166, represent separate address generation and sequencing hardware for each data path 148-152. Thus, in the XIMD model, next state function 156, 160, and 166 is a function of the data path state of each data path 148-152 and each control path state from program counters 154, 158, and 162.
Therefore, the problem remaining in the prior art is to provide a data processing system that has performance capabilities substantially similar to a data processing system utilizing ROPE architecture which utilizes substantially fewer prefetch elements, and utilizes program memory more efficiently by reducing the amount of program memory space that is not utilized to store instruction information.