A high-speed computer needs fast access to data in memory. The largest and fastest of such computers are known as supercomputers. One method of speeding up a computer is by "pipelining," wherein the computer's digital logic between an input and an output is divided into several serially connected successive stages. Data are fed into the computer's input stage before data previously input are completely processed through the computer's output stage. There are typically many intermediate stages between the input stage and the output stage. Each stage performs a portion of the overall function desired, adding to the functions performed by previous stages. Thus, multiple pieces of data are in various successive stages of processing at each successive stage of the pipeline between the input and output stages. Preferably, each successive system clock propagates the data one stage further in the pipeline.
As a result of pipelining, the system clock can operate at a faster rate than the speed of system clocks of non-pipelined machines. In some of today's computers, the system clock cycles in as fast as two nanoseconds ("ns"), allowing up to 500 million operations per second though a single functional unit. Parallel functional units within each processor, and parallel processors within a single system, allow even greater throughput. Achieving high-performance throughputs is only possible, however, if data are fed into each pipeline at close to the system clock rate.
As processor speeds have increased, the size of memory in a typical computer has also increased drastically. In addition, error-correction circuitry is now placed in the memory path to increase reliability. Memory-access speeds have improved over time, but the increased size of memory and the complexity of error-correction circuitry have meant that memory-access time has remained approximately constant. For example, a typical supercomputer system clock rate may have improved from roughly 8 ns to 4 ns to 2 ns over three generations. Over the same time period, memory-access times may have remained at approximately 96 ns. These times mean that the 8-ns processor accesses memory in 12 clocks, the 4-ns processor in 24 clocks, and the 2-ns processor in 48 clocks. As a result, a computer which randomly accessed data throughout memory would see almost no overall dataprocessing-speed improvement even if the system clock rate is increased dramatically.
One solution has been to organize data into vectors, each comprising a plurality of data elements, and where, during processing, each element of a vector has similar operations performed on it. Computer designers schedule various portions of the memory to simultaneously fetch various elements of a vector, and these fetched elements are fed into one or more parallel pipelines on successive clock cycles. Within a processor, the vector is held in a vector register comprised of a plurality of vector register elements. Each successive vector-register element holds a successive element of the vector. A "vector-load" operation transfers a vector from memory into a vector register. For example, a vector in memory may be held as a vector image wherein successive elements of the vector are held in successive locations in memory. A vector-load operation moves elements which comprise a vector into pipelines which couple memory to the vector registers. Overlapped with these vector-load operations, there could be two other pipelines taking data from two other vector registers to feed a vector processor, with the resultant vector fed through a pipeline into a third vector register. Examples of such designs are described in U.S. Pat. No. 4,661,900 issued Apr. 28,1987 to Chen et al. and U.S. Pat. No. 5,349,667 issued Sep. 20, 1994 to Cray et al., which are hereby incorporated by reference. For example, in a well-tuned system using 2-ns pipeline clocks, the throughput can approach 500 million operations per second for a single vector processor, even with relatively slow memory-access times.
On the other hand, a scalar processor operating in such a system on somewhat randomly located data must deal with a 48-clock pipelined-access time, and must often wait for the results from one operation before determining which data to request next.
In very-high-speed vector processors, such as the Cray Y-MP C90 manufactured by Cray Research Inc., the assignee of the present invention, a computer system contains a number of central processing units ("CPUs"), each of which may have more than one vector processor and more than one scalar processor. The computer system also contains a number of common memories which store the programs and data used by the CPUs. Vector data are often streamed or pipelined into a CPU from the memories, and so a long access time may be compensated for by receiving many elements on successive cycles as the result of a single request. In contrast, scalar data read by one of the CPUs from one of the common memories may take an inordinate amount of time to access.
Many computers use virtual or logical addresses to simplify the generation of programs. In such a computer, there are typically a plurality of programs simultaneously loaded into memory, and the computer time-multiplexes among these programs. In some computers, programs are loaded and must remain in their entirety in memory before they can run. If programs must be placed into just one or two segments in physical memory, the memory-manager program must find or provide a contiguous space in physical-memory space for each such large segment. If no space large enough can be found, programs in physical-memory must be swapped out to mass storage (such as disk storage) in order to make room. In order to provide finer granularity and ease the task of memory management, a program is subdivided into smaller pieces.
One conventional method of doing this is to provide a large number of page-sized divisions of a logical-address space, and map these onto equal-sized page frames in a physical-address space--typically a time-consuming process.
In some computers, only relatively small portions of a program are brought into memory at any one time, and as the program runs, the computer detects attempted accesses to pages not in memory (such accesses are called "page faults"), and the computer interrupts the program in order to load the needed page, and later resumes execution of the program. Such page-fault interrupts are time-consuming and handling them reliably is quite complex.