The invention relates generally to multiprocessor computers, and more specifically to multiprocessor system with the capability of shutting down and replacing hardware in a partition while the remainder of computer system remains operational.
This application hereby incorporates by reference pending patent application Ser. No. 08/971,184, filed on Nov. 17, 1997.
Multiprocessor computer systems have long been valued for the high performance they offer by utilizing multiple processors that are not individually capable of the same high level of performance as the multiprocessor system. In such multiprocessor systems tasks are divided among more than one processor, such that each processor does a portion of the computation of the system. Therefore, more than one task can be carried out at a time with each task or thread running on a separate processor, or a single task can be broken up into pieces that can be assigned to each processor. Multiprocessor systems incorporate many methods of dividing tasks among their processors, but all benefit from the ability to do computations on more than one processor simultaneously.
Traditionally, multiprocessor systems were large mainframe or supercomputers with several processors mounted in the same physical unit. Modern multiprocessor systems include arrays of interconnected computers or workstations that divide large tasks among themselves in much the same way as the processors of traditional mainframe systems, and achieve similarly impressive results. Many multiprocessor computer systems have a combination of theses attributes, such as a group of multiprocessor systems that are interconnected.
With multiple processors and multiple computational processes within a multiprocessor system, a mechanism must exist for allowing processors to share access to data and share the results of their computations. Centralized memory systems use a single central bank of memory that all processors can access, such that all processors can access the central memory at roughly the same speed. Still other systems have distributed or independent memory for individual processors or groups of processors and provide faster access to memory that is local to each processor or group of processors, but access to data from other processors takes somewhat longer than in shared memory systems.
The memory, whether centralized or distributed, can further be shared or multiple address type memory. Shared address memory systems allow multiple processors to access the same memory, whether distributed or centralized, to communicate with other processors via data stored in the shared memory. Multiple address memory incorporates separate memory for each processor or group of processors, and does not allow access to this local memory to other processors. Such multiple address or local memory systems must rely on messages to share data between processors. Cache memory can be utilized in any of these memory configurations to attempt to provide faster access to data each processor is likely to need and to reduce requests for the same commonly used data from multiple processors on the system bus.
Cache in a multiple address system simply caches data from the local memory, but cache in a shared address system typically caches memory from any of the shared memory locations, whether local or remote from the processor requesting the data. The cache associated with each processor or group of processors in a distributed shared memory system likely maintains copies of data from memory local to a number of other processor nodes. Information about each block of memory is kept in a directory, which keeps track of data such as which caches have copies of the block, whether the cache is dirty, and other related data. The directory is used to maintain cache coherency, or to ensure that the system can determine whether the data in each cache is valid. The directory is also used to keep track of which caches hold data that is to be written, and facilitates granting exclusive write access to one processor or I/O device. After write access has been granted and a memory location is updated, the cached copy are marked as dirty.
As described above, processors in parallel processing or multiprocessor systems must communicate with the other processors to share computational data to effectively share tasks or work on the same task, and so must be configured to communicate with the number of processors or processor groups present in the multiprocessor system. Installation of a new processor or removal of an old processor requires reconfiguration of the multiprocessor system so that tasks and information are not communicated to a processor that has been removed or so that a new processor is utilized by the system. Furthermore, changing the number of processors in a system has traditionally required shutting down the entire system and restarting the system after reconfiguration so that the hardware can reset and the software can reboot across the system, and to ensure that the system does not encounter errors due to other software or hardware management problems. Also, removal of cache and management of the remaining cache, along with the loss or insertion of shared memory have posed additional obstacles to inserting or removing processors in an operating system.
What is desired is an architecture and method providing the ability to remove or insert hardware while the remainder of the system remains operational.
A distributed shared memory multiprocessor computer system is provided, which has a number of processors and is divided into partitions. Each partition has within it one or more of the processors, and may also have memory or cache and other related hardware. Although each partition works together and communicates with other partitions to share computational load, the partitions each are independently operable and execute an independent copy of the operating system. The partitions comprise additional features as described herein, to enable removal of a partition from the operating computer system, and to enable insertion of hardware into the operating computer system.