This invention relates in general to multi-processor computer systems, and in specific to a service processor that supports on-line reconfiguration of the multi-processor computer.
Prior computer platforms have been symmetric multi processors (SMP) arrangements where multiple CPUs are running a single copy of the operating system (OS). The OS provides time sharing services to allow multiple applications to run. However, this arrangement permits the applications to interfere with each other. For example, if the system is running an accounting application, the accounting application can allocate all the memory in the system, as well as use all the processors that the OS can allocate. Then, when some other application needs processors or memory, for example a manufacturing application, it would not be able to allocate any memory or processors for its needs, and therefore would freeze. Thus, the manufacturing application would have been frozen or impacted by the accounting application. This arrangement also leaves the system vulnerable to failures. Any problem with one application could corrupt the resources for all applications.
A known solution to this problem is to separate the computer system into partitions or protected domains. These partitions are hardware separations that place resources into separate functional blocks. Resources in one block do not have direct access to resources in another block. This prevents one application from using the entire system resources, as well as contains faults and errors. An example of such a system is the Sun Microsystems UE10K.
This solution presents its own problem, namely reconfiguration. The partitions, once defined, are static in nature, and cannot be readily changed. Thus, typical changes such as component replacement or system upgrades cannot be easily accomplished. Moreover, resources cannot be readily moved from one partition to another to satisfy workload balancing.
These and other objects, features and technical advantages are achieved by a system and method, which satisfy two aspects of reconfiguring a multiple partition computer system:
1. Reconfiguring the installed hardware, possibly while the various partitions continue normal operations. This aspect includes adding and removing cells and I/O from partitions which may or may not continue to run.
2. Changing the association between cells, I/O and partitions. The partitions may be able to stay running, or may have to be shut down (as for example in the case where a partition has all its cells moved to a different partition).
Several terms are defined in this paragraph which are necessary to understand the concepts underlying the present invention. A complex is a grouping of one or more cabinets containing cell boards and I/O, each of which can be assigned to a partition. Partitions are groupings of cell boards, with each partition comprising at least one cell. Each partition would run its own copy of system firmware and the OS. Each cell board can comprise one or more system CPUs together with system memory. Each cell board can optionally have I/O connected to it. Each partition must have at least enough I/O attached to its cell board(s) to be able to boot the OS. I/O (Input/Output subsystem) comprises an I/O backplane into which I/O controllers (e.g. PCI cards) can be installed, and the I/O controllers themselves. Cell boards in each cabinet are plugged into a backplane which connects them to the fabric. The fabric is a set of ASICs which allow the cells in a partition to communicate with one another, potentially across cabinet boundaries.
Cell boards are connected to I/O controllers in such a way that software or firmware running on a partition can operate the I/O controllers to transfer data between system memory and external disks, networks, and other I/O devices. One particular type of I/O controller is specialxe2x80x94the Core I/O controllerxe2x80x94which provides the console interface for the partition. Every partition must have at least one Core I/O controller installed. A complex has at least one service processor.
Thus, with a multiple partition system, multiple copies of the OS are running independently of each other, each in a partition that has its own cell boards with processors and memory and connected I/O. This provides isolation between different applications. Consequently, a fatal error in one partition would not affect the other partitions. Partitioning also permits server consolidation by eliminating older servers and migrating their applications onto their own partitions.
A hierarchy of components comprising a configuration tool running under control of the OS, the OS itself, system firmware, and a service processor is used to manage the partitions. The user interacts with the configuration tool to instruct the system to reconfigure itself. The configuration tool interacts with the OS, which interacts with system firmware, which interacts with the service processor to effect the desired reconfiguration.
A network of micro-controllers connected to the service processor, via a communications link, provides the service processor with information on each of the different cells, as well as a pathway to receive requests for configuration changes, and a pathway to command changes in the different cells or I/O. The combination of the service processor and the micro-controllers support dynamic changes to the partition configuration. For example, cell boards can be removed from a partition while the partition is on-line, and additional cell boards can be added to expand the system. Also, resources can be moved between the partitions. For example, a cell board can be reassigned from one partition to another partition while the system is on-line. The service processor can control the power for the cell boards and I/O subsystems which comprise the partitions.
Another aspect of the inventive system is the complex profile. The complex profile is a data structure containing information about the system configuration, such as which cells are assigned to particular partitions. The complex profile is initially created by the service processor, is modified by the configuration tool under control of the user, and is then downloaded to the service processor. Duplicate copies of the complex profile are stored through out the system, e.g. in the different cells. Thus, if the service processor should fail, the system can still operate using the copies of the complex profile dispersed through the system.
Therefore it is a technical advantage of the present invention to allow online replacement of components, including core I/O, cell boards, I/O backplane, service processor board, and cabinets.
It is another technical advantage of the present invention to allow multiple redundant components, e.g. core I/O boards, in a partition, so that if one component fails the operating system can switch to an alternate.
It is a further technical advantage of the present invention to allow the system to power-up and boot even if the service processor is not installed or is not functional.
It is a further technical advantage of the present invention to allow partitions to power-up and boot even if all the cells in a partition are not installed or are not functional.
It is a still further technical advantage of the present invention to allow initial creation of the partitions according to a predetermined arrangement, thereby permitting the OS to run.
It is a still further technical advantage of the present invention to allow distribution of the partition definitions or complex profile to all cells, and ensuring that all cells have the same partition definition.
It is a still further technical advantage of the present invention to have an interface to applications which allow the user to reconfigure the system.
It is a still further technical advantage of the present invention to allow communication of configuration information between partitions, even though the partitions are not able to communicate directly.
It is a still further technical advantage of the present invention to allow configuration tools to run on any one of the partitions, thereby saving the costs associated with providing a separate system to do reconfiguration.
It is a still further technical advantage of the present invention to provide an interface between the configuration tool and the operating system, and between the operating system and system firmware, and between system firmware and the service processor, such that each of the subsystems can cooperatively participate in system reconfiguration, while remaining online and fully functional.
It is a still further technical advantage of the present invention to allow the user to update any piece of firmware in the system to bring all modules to a compatible level of firmware, thereby allowing reconfiguration operations to involve modules with incompatible firmware installed on them.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.