U.S. Pat. No. 5,051,940, issued Sep. 25, 1991, to S. Vassiliadis et al., entitled: "Data Dependency Collapsing Hardware Apparatus," is one of several prior developments in the art related to a SCISM processor, a high speed computer which is enabled by compounding and compounding apparatus, to provide parallel performance of systems which can process instructions and data for programs which could be handled by older architectures, but which can also be handled by newer architectures which employ the Scalable Compound Set Machine Architecture which was introduced in the description of U.S. Pat. No. 5,051,940 and in the above referenced applications.
In high speed computers, it is desirable; to reduce the time required to complete, or execute, each instruction in order to improve performance. This is typically done by clocking the processor at the maximum rate that can be sustained by the underlying circuitry, or by reducing the average number of clock cycles needed to complete instruction execution through some form of parallel operation. One such form of parallelism well known in the art is pipelining, wherein instruction execution is subdivided into a-number of specifically defined steps related to various areas of logic, or pipeline stages, in the processor. As one instruction completes its activity in a given pipeline stage, it is sent to the next stage, and a subsequent instruction can then make use of the stage vacated by the instruction ahead of it. Thus, several instructions are typically being executed simultaneously in such a computer system, but each instruction is dispatched for the execution process one at a time. More recently, in order to further improve performance, computer designs have been developed wherein multiple instructions may be simultaneously dispatched for execution, provided such instructions do not conflict with each other while being executed. Sufficient hardware must be provided so that the instructions which simultaneously occupy a given stage in the pipeline can execute without interfering with each other. Typically, the instructions are processed through the pipeline together and are completed simultaneously, or at least in conceptual order. This mode of execution has been given the name superscalar execution.
One of the difficulties which typically must be addressed in superscalar processor design is making the decision whether multiple instructions may in fact be simultaneously executed. In most cases, the superscalar designs will not be able to simultaneously execute any and all possible combinations of instructions due to interdependencies between some instructions, and perhaps some limitations of the underlying hardware. Therefore, as instructions reach the point where execution is to begin, a decision must be made whether to permit parallel execution, or default to single instruction execution mode. The decision is usually made at the time instructions enter the pipeline, by logic circuits which decode the instructions to detect whether conflicts actually exist. Depending on the particular instruction set architecture, the decoding process may be relatively complicated and require a large number of logic stages. This can reduce performance either by increasing the cycle time of the processor, or by requiring an additional pipeline stage to perform the aforementioned decoding process, either of which will reduce performance.
SCISM application Ser. No. 07/519,382 provides a solution for the problem of delay caused by the need to analyze instructions for superscalar execution through the expedient of preprocessing the instruction stream and making a determination of groups of instructions suitable for superscalar execution. These groups of instructions are called compound instructions, and are composed of the original instructions and an associated tag which indicates whether parallel execution is permitted. SCISM application Ser. No. 07/522,291 proposes an Instruction Compounding Unit, or ICU as a means of performing the instruction compounding analysis required by Scalable Compound Instruction Set Machines (SCISM). Instructions are analyzed by the ICU as they are fetched from memory and placed in a cache. The ICU forms the tag, which is logically stored along with the instructions in the cache. Certain problems arise, however, when the ICU concept is applied to S/370 and related architectures. In particular, portions of cache lines that have not or cannot be analyzed for compounding may result.
U.S. Pat. No. 5,051,940 has provided a solution for this problem to a large extent using what is termed the worst-case compounding algorithm. With this algorithm, the contents of a cache line, be it instructions, data, or instructions mixed with data, may be analyzed for compounding in its entirety without regard to any instruction boundaries within the cache line. Still, the problem of compounding across cache line boundaries, or cross-line compounding, remains. An instruction can only be compounded with a subsequent instruction if the subsequent instruction is available for analysis at the time the compounding process occurs. Instructions situated near the end of a cache line may not be considered for compounding unless the next sequentially addressable cache line is also present, and therefore typically are ineligible for parallel execution, thereby decreasing processor performance.
The degree to which performance is compromised depends on a number of circumstances, such as cache line size and the frequency of execution of particular sequences of instructions. Larger cache line sizes reduce the percentage of instructions which reside adjacent to cache line boundaries, but there is usually an optimum upper bound on cache line size that if exceeded, will decrease performance due to excessive storage accesses for unneeded data. Frequency of instruction execution is typically not correlated with cache line boundaries, and it is perfectly possible for a performance-critical loop in the instruction stream to sit astride a cache line boundary. This effect can contribute to unpredictable and unsatisfactory performance.
In application Ser. No. 07/522,291, the inventors suggest cache line pre-fetching as a means of facilitating cross-line compounding. However, cache line prefetching creates other problems, two of which are set out here.
1. Room must be made in the cache for the prefetched line, possibly causing a soon-to-be-needed line to be removed from the cache in favor of the prefetched line, which may in fact, never be used, resulting in decreased processor performance.
2. Depending on the processor busing structure, prefetching may require occupation of the processor data bus while the line is being prefetched. Consequently, the processor's execution units may be blocked from using the bus while the fetch is in progress. Any such blockage results in decreased performance.
It is desirable to provide a means for allowing compounding of instructions across cache line boundaries without the requirement to prefetch cache lines.