Many scientific data processing tasks involve extensive arithmetic manipulation of ordered arrays of data. Commonly, this type of manipulation or "vector" processing involves performing the same operation repetitively on each successive element of a set of data. Most computers are organized with one central processing unit (CPU) which can communicate with a memory and with input-output (I/O). To perform an arithmetic function, each of the operands must be successively brought to the CPU from memory, the functions must be performed, and the result returned to memory. However, the CPU can usually process instructions and data faster than they can be fetched from the memory unit. This inherent memory latency results in the CPU sitting idle much of the time waiting for instructions or data to be retrieved from memory. Machines utilizing this type of organization, i.e. "scalar" machines, have therefore been found too slow and hardware inefficient for practical use in large scale vector processing tasks.
In order to increase processing speed and hardware efficiency when dealing with ordered arrays of data, "vector" machines have been developed. A vector machine is one which deals with ordered arrays of data by virtue of its hardware organization, thus attaining a higher speed of operation than scalar machines. One such vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978 to Cray, which patent is incorporated herein by reference.
The vector processing machine of the Cray patent is a single processor machine having three vector functional units specifically designed for performing vector operations. The Cray patent also provides a set of eight vector registers. Because vector operations can be performed using data directly from the vector registers, a substantial reduction in memory access requirements (and thus delay due to the inherent memory latency) is achieved where repeated computation on the same data is required.
The Cray patent also employs prefetching of instructions and data as a means of hiding inherent memory latencies. This technique, known as "pipelining", involves the prefetching of program instructions and writing them into one end of an instruction "pipe" of length n while previous instructions are being executed. The corresponding data necessary for execution of that instruction is also fetched from memory and written into one end of a separate data pipeline or "chain". Thus, by the time an instruction reaches the read end of the pipe, the data necessary for execution which had to be retrieved from memory is immediately available for processing from the read end of the data chain. By pipelining instructions and chaining the data, then, most of the execution time can be overlapped with the memory fetch time. As a result, processor idle time is greatly reduced, and processing speed and efficiency in large scale vector processing tasks is greatly increased.
Computer processing speed and efficiency in both scalar and vector machines can be further increased through the use of multiprocessing techniques. Multiprocessing involves the use of two or more processors sharing system resources, such as the main memory. Independent tasks of different jobs or related tasks of a single job may be run on the multiple processors. Each processor obeys its own set of instructions, and the processors execute their instructions simultaneously ("in parallel"). By increasing the number of processors and operating them in parallel, more work can be done in a shorter period of time.
An example of a two-processor multiprocessing vector machine is disclosed in U.S. Pat. No. 4,636,942, issued Jan. 13, 1987 to Chen et al., which patent is incorporated herein by reference. Another aspect of the two-processor machine of the Chen '942 patent is disclosed in U.S. Pat. No. 4,661,900, issued Apr. 28, 1987 to Chen et al., which patent is incorporated herein by reference. A four-processor multiprocessing vector machine is disclosed in U.S. Pat. No. 4,745,545, issued May 17, 1988 to Schiffleger, and in U.S. Pat. No. 4,754,398, issued Jun. 28, 1988 to Pribnow, both of which are incorporated herein by reference. All of the above named patents are assigned to Cray Research, Inc., the assignee of the present invention.
Another multiprocessing vector machine from Cray Research, Inc., the assignee of the present invention, is the Y-MP vector supercomputer. A detailed description of the Y-MP architecture can be found in the co-pending and commonly assigned U.S. Pat. No. 5,142,638, issued Aug. 25, 1992, entitled "MEMORY ACCESS CONFLICT RESOLUTION SYSTEM FOR THE Y-MP" which application is incorporated herein by reference. In the Y-MP design, each vector processor has a single pipeline for executing instructions. Each processor accesses common memory in a completely connected topology which leads to unavoidable collisions between processors attempting to access the same areas of memory. The Y-MP uses a collision avoidance system to minimize the collisions and clear up conflicts as quickly as possible. The conflict resolution system deactivates the processors involved and shuts down the vectors while the conflict is being resolved.
Although the above-mentioned multiprocessor vector supercomputing machines greatly increase processing speed and efficiency on large scale vector processing tasks, these machines cannot be run continuously without the occurrence of hardware failures. Therefore, periodic preventive maintenance ("PM") time is scheduled, during which the operating system is shut down and diagnostics are run on the machine in an effort to prevent hardware failures before they occur. PM time, while reducing the number of hardware failures which occur during run time of the operating system and associated user tasks, also decrease the amount of time the operating system can be up and running. Today many user processing tasks are so large that the user simply cannot afford any PM time. These users require that the system simply cannot go down. As a result, machines must be run until they break (i.e., a hardware failure occurs), and then the operating system down time must be reduced to a minimum. Those skilled in the art have therefore recognized the need for a computing system which minimizes shut down time such that the run time of the operating system can be maximized.
Another problem facing multiprocessing system designers is how to recover from a failure of a section of shared memory. If one of the processors of a multiprocessing system fails that processor can simply be shut down and the rest of the processors in the system can continue running user tasks with only a relatively small reduction in performance. However, problems are encountered when a processor is removed from the operating system. For example, any I/O attached to that processor is also unavailable to the operating system, and the entire multiprocessor system must be brought down, reconfigured, and the I/O is reassigned to the remaining processors such that all I/O is available to the operating system.
In addition, if a portion of the shared memory fails, shutting down a processor will not solve the problem because the remaining processors will continue to see the same memory errors. Previously, the only way to recover from a shared memory failure was to shut down the operating system and associated user tasks, and run diagnostics to isolate the failure. Once the failure was located, the machine was turned off completely, and the defective module replaced.
Although the above described method can effectively locate and repair failed hardware, it significantly reduces the time that the operating system is up and running. As stated previously herein, this is an undesirable result for many users whose processing tasks require almost continuous operation of their multiprocessor systems. Those skilled in the art have therefore recognized the need for a multiprocessor computing system which has a minimum amount of PM time, which can recover effectively from hardware failures on both the processor side and the shared memory side, all the while maximizing the time that the operating system is up and running.