1. Field of the Invention
The present invention relates to shared memory multiprocessors and, more specifically, to a single-chip macro-schedule multiprocessor architecture, providing program-controlled cooperation of the processors with explicit use of instruction-level parallelism.
2. Description of the Prior Art
Today's fast growth of transistor-per-chip number raises the question of how to gain a respectively higher performance. One alternative is to build larger on-chip memories, but this approach can be efficient only to a certain point, after which adding more cache provides a minor performance improvement. Thus, a preferred alternative at this point is to exploit more parallelism. There are generally two approaches: instruction-level parallelism (ILP) and thread-level parallelism (TLP).
Use of instruction-level parallelism (ILP) involves parallel execution of the instruction groups, which helps the performance growth. There are dynamic (superscalar) and static (Very Long Instruction Word—VLIW and Explicit Parallel Instruction Computing—EPIC) approaches to ILP use. With the dynamic approach, parallel instruction groups are hardware-generated at the program run, and with the static approach, at the program compilation. An example of the dynamic approach is provided with the microprocessor Pentium IV of Intel (see “Pentium 4 (Partially) Previewed”, Peter N. Glaskowsky, Microprocessor Report, Aug. 28, 2000-01). An example of the static approach is provided with the microprocessor Itanium of Intel (see “Merced Shows Innovative Design”, Linley Gwennap, Microprocessor Report, volume 13, number 13, Oct. 6, 1999).
In the dynamic approach, there is a big dynamic hardware window for viewing the executed instructions (in the Pentium IV the window views over 100 instructions) where all possible operation collisions are resolved. In the case of the static approach, the compiler forms instruction groups for their parallel execution and schedules their optimal execution with regard to each instruction execution time and possible inter-instruction collisions. In this case, the instruction groups include only independent instructions. This approach simplifies the processor hardware. The size of a parallel execution instruction group for modern superscalar architecture microprocessors generally reaches 4-6 instructions, with future increases up to 8 instructions (see microprocessor Power 4 IBM, “IBM's Power 4 Unveiling Continues”, Microprocessor Report Nov. 20, 2000-3). For static architecture microprocessors it generally reaches from 6-12 instructions (see IA-64, Itanium, McKinley) to over 20 instructions (see Keith Diefendorff, “The Russians Are Coming”, Microprocessor Report, pp. 1, 6-11, vol. 13, number 2, Feb. 15, 1999).
Further increase of the parallel execution instruction group size leads to physically large monolithic cores and complex control mechanisms, which are limiting factors for increases in the clock frequency. The number of access ports to register files and internal caches is growing. The hardware for resolving inter-instruction dependencies in superscalar microprocessors is becoming complicated. The probability of unaccounted collisions in a static architecture microprocessor during compilation is growing, which results in violations of the schedule made at compile time causing additional delays at the program run. Moreover, design and verification become too complicated and time-consuming.
Thread-level parallelism (TLP) is a perspective method of further performance increases for dynamic and static architectures. Use of thread-level parallelism (TLP) involves parallel execution of many program threads in a multiprocessor environment. Threads are weakly coupled or just independent fragments of one program allowing their parallel execution on different processors with small overheads for control and synchronization, which are performed by the operation system and by means of semaphores. However, not all applications can be parallelized in such a way. A major difficulty is posed by parallelization of the integer applications, which have data dependencies and short parallel branches, because synchronization using semaphores is very costly for them.
Static architectures have a potential for performance growth in a multiprocessor system due to a more aggressive use of ILP and application of the static scheduling method to a parallel execution on many processors. The examples of ILP use can be really independent in-program computations (separate expressions, procedures, loop iterations, etc.), as well as compiler optimizations aimed at speeding-up the computations due to parallel execution of possible alternatives (the so-called speculative and predicative computations). This may allow to increase utilization of ILP in the programs by up to 63%. (See Gary Tyson et al., “Quantifying Instruction Level Parallelism Limits on an EPIC Architecture”, ISPASS-2000, Austin, Tex., 2000.)
The compiler for static macro-schedule architecture performs a global scheduling of the program execution taking into account the available data and control dependences. In this case the number of instructions in a group intended for parallel execution (super-wide instruction) is equal to the total number of instructions in all instruction groups (wide instruction) in all processors of the multiprocessor system. That is, the compiler makes a schedule for a synchronous execution of the super-wide instructions in all processors of the system. A sequence of wide instructions to be executed in one processor forms a wide instruction stream or simply a stream. Thus, the schedule for the whole program execution is divided into a multitude of streams in compliance with the available number of processors.
While making a schedule for parallel operation of all processors in a multiprocessor system, the compiler forms streams for each processor to minimize data and control dependencies between different streams. This shortens the delays caused by a necessity to access the context of another stream executed in another processor. The streams can be executed independently of each other until an explicitly specified synchronization instruction appears in the instruction sequence. During the program run the static schedule can be violated, which is caused by collisions arising in different processors, which cannot be accounted at the compilation stage. Examples of such collisions may be a cache miss, data-dependent divide and multiply operations, etc. For this reason it is necessary to have synchronization means, i.e., maintenance of the specified sequence of executing separate fragments in different streams with the aim to properly resolve the data and control dependencies. The efficiency of the macro-schedule multiprocessor system depends largely on the efficiency of the interstream context access and synchronization means implementation.
A single-chip multiprocessor is generally most suited for static macro-schedule execution. A single-processor chip has a limited number of external connections caused by the constrained package abilities. A single-processor chip typically has only system interface for access to main memory, other processors and I/O. Unlike this, the single-chip multiprocessor besides the system interface may include very fast and wide interprocessor connections data exchange, internal caches coherence support and synchronization of the streams executed in parallel.
A single-chip multiprocessor may have a virtual processor numbering, which allows for simultaneously performing independent programs providing sufficient processor resources. Further performance increases may be attained in a multi-ship system comprising single-chip multiprocessors, in which interchip access and synchronization may be handled in a traditional way using a semaphore method, etc.
Static macro-schedule architecture efficiently uses TLP, since in this case the threads may be considered as streams with weak data and control dependencies.
ExpLicit Basic Resource Utilization Scheduling (ELBRUS) microprocessor architecture (see Keith Diefendorff, “The Russians Are Coming”, Microprocessor Report, pp. 1, 6-11, vol. 13, number 2, Feb. 15, 1999) is mostly suited for the single-chip multiprocessor using static macro-schedule, because ELBRUS architecture is oriented to the execution of the static clock cycle-precise scheduled program with explicit parallelism.
An ELBRUS microprocessor wide instruction may contain over 20 operations (simple instructions of the type: load, store, add, multiply, shift, logic and others). An ELBRUS microprocessor has additionally speculative and predicative mode operations, which increases its potentialities to efficiently use ILP. A scoreboarding feature allows automatic correction of the static schedule, when dynamic collisions arise during the program run.
An object of the present invention is therefore a method of synchronization and control of parallel execution of streams of a macro-scheduled program without addressing the operation system, based on the static macro-scheduling of the program. Another object of the present invention is to provide a single-chip multiprocessor with interprocessor connections for fast registers' data exchange, acceleration of cache coherency support and synchronization of parallel streams execution. A further object of the present invention is an ExpLicit Basic Resource Utilization Scheduling (ELBRUS) microprocessor with means for interprocessor synchronization and interprocessor exchange of data and addresses through above-mentioned interprocessor connections.