Computer systems, particularly large ones like telecommunication switching exchanges and telecommunication servers and various large network servers, typically comprise multiple computers, or computer units, linked to each other and running in parallel. FIG. 1 illustrates a typical prior art computer system 100 comprising computers 101-106 linked to each other with message bus 110.
These large computer systems often require high reliability. For example, downtime associated with a telecommunication switching exchange needs to be minimized in order to provide acceptable quality of service. A common way to implement high reliability by way of fault tolerance is to replicate at least some of the computers, computer processes and other elements of the computer system. In other words, e.g. a computer unit may be provided with a spare unit so that the computer unit and its spare unit constitute a pair wherein the spare unit is provided with input messages identical to the input messages the paired computer unit is provided with, and wherein the spare unit executes computations identical to the computations executed by the paired computer unit. However, output messages sent by the spare unit are discarded, resulting therefore in the spare unit effectively having no contribution to the operation of the computer system. Such a computer system is also known as a redundant computer system.
In the art, various functional modes are defined for indicating the contributory role of functions of a computer to the overall performance of a computer system at a given moment. In the art, such terms as “operational mode”, “operational state” or “working state” are sometimes used instead of the term “functional mode”.
Depending on whether a first computer unit is actively contributing to the operation of the computer system or whether the first computer unit has been assigned to function as a spare unit of a second computer unit, the first computer unit is considered to be in a functional mode of “working” or in a functional mode of “spare”, respectively. That is, a computer in the functional mode of “working” is actively contributing to the operation of the computer system, e.g. in a direct fashion by controlling e.g. a hardware element (other than a hardware element in the computer itself), or in an indirect fashion wherein output (e.g. messages, file modifications) produced by the computer is forwarded e.g. to another computer in the computer system, to a hard drive in the computer system (outside the computer itself), or outside the computer system. In other words, the computer in the functional mode of “working” is connected to the rest of the computer system, and it is functioning normally, and it is contributing to the overall performance of the computer system.
On the other hand, a computer in the functional mode of “spare” receives input messages identical to the input messages sent to its paired computer, and the computer in the functional mode of “spare” executes computations identical to the computations executed by its paired computer, but output messages sent by the computer in the functional mode of “spare” are discarded, resulting therefore in the computer in the functional mode of “spare” effectively having no contribution to the operation of the computer system. In other words, the computer in the functional mode of “spare” is connected to the rest of the computer system, and it is functioning normally, but it is not contributing to the overall performance of the computer system.
As illustrated in FIG. 1, a large computer system, and particularly a redundant one, typically comprises a recovery system 120. The recovery system 120 provides various supervisory and diagnostic functions to facilitate recovering a failed computer. In addition, the recovery system controls—typically in a centralized manner—the functional modes of the computers of the computer system. In the art, such terms as “high availability system” or “supervision state management system” are sometimes used instead of the term “recovery system”. Even though FIG. 1 depicts a singular recovery system 120, a recovery system is typically implemented by distributing it at least partly in the computer system. For example, a computer in the computer system may comprise low-level supervision software that supervises the status of the computer (for example, whether the operating system and application processes are functioning normally in the computer) and reports the status periodically to the main recovery system.
If a critical enough failure is detected while supervising the computer, the computer may need to be reset and restarted in a functional mode of “testing” in which various diagnostic tests are performed on the computer e.g. by diagnostic software of the recovery system in order to locate the cause of the failure. In other words, the computer in the functional mode of “testing” is not functioning normally and it is not contributing to the overall performance of the computer system, but it is still connected to the rest of the computer system.
If the cause of the failure is located, the computer is separated from the rest of the computer system to allow restoring or replacing it. In other words, the computer is now in a functional mode of “separated”. Another reason for a computer being in the functional mode of “separated” is the computer being not yet installed in the computer system. Therefore, the computer in the functional mode of “separated” is not functioning normally and it is not contributing to the overall performance of the computer system, and it is not connected to the rest of the computer system.
The above-described change over from one functional mode to another is typically performed under the supervision of the recovery system. For example, in response to the above-described detection of the failure in the computer in the functional mode of “working”, it is the recovery system that will step in and launch the change over from the functional mode of “working” to the functional mode of “testing” by restarting the computer. Typically, program modules for the operating system of the computer in question were loaded—under the supervision of the recovery system—into the memory of the computer while the computer was initially powered up. Therefore, while restarting the computer, the operating system is typically not reloaded in order to speed up changing the functional mode.
The above described large computer systems have typically evolved over a long period of time. As a result, many such computer systems traditionally utilize proprietary operating systems as opposed to commercially available operating systems, such as Linux and UNIX. For example, a family of DX 200 telecommunication switching exchanges developed by the present assignee utilizes a proprietary operating system known as DMX.
Yet, a recent trend is that of implementing new services on servers utilizing commercially available operating systems, such as Linux and UNIX. Furthermore, many standardizing forums are defining interfaces for applications and solutions for common system functions for these commercially available operating systems. The number of protocol and interface related software for commercially available operating systems is increasing rapidly.
Since the already existing large legacy computer systems typically have a huge amount of software, it is not realistically possible to redesign or replace this already existing software with new software. This leaves the option of trying to find solutions that allow running proprietary legacy software in parallel with new commercially available software. This is illustrated in FIG. 1 with computer 105 running commercially available software on Linux operating system and with computers 101, 102, 103, 104, and 106 running proprietary legacy software on DMX operating system.
Yet, there are situations in which both proprietary legacy software and new commercially available software need to be run on the same computer. For example, the above-described supervisory and diagnostic software is typically only available as proprietary software, wherein application software run in the computer may only be available as commercial software. As a result, such a computer needs to have both a corresponding proprietary operating system and a corresponding commercially available operating system installed in it allowing running the application software on the commercially available operating system and the supervisory and diagnostic software on the proprietary operating system.
As a result, multiple operating systems need to co-exist on a single computer. Prior art includes some solutions for this. For example, a dual-boot feature is known in which e.g. two operating systems are installed on a single computer, and during start-up a user is allowed to manually select which operating system to start. However, this solution has the drawback of requiring manual input from a user thus making it cumbersome to use. Furthermore, this solution has the drawback of loading the selected operating system from disk or other mass-storage media thus slowing down the loading of the selected operating system.
Prior art further includes splitting a computer into two virtual machines, which use different operating systems. An example of such a solution is a nanokernel product called OSware by Jaluna corporation (see http://www.jaluna.com). However, this solution has the drawback of complexity: the nanokemel shares the computer and its resources for two virtual machines, switching between the partitions of each operating system takes time, and communication between the two virtual machines requires extra modifications.
Therefore, the object of the present invention is to alleviate the problems described above and to introduce a solution that allows starting an operating system out of multiple operating systems automatically, i.e. without requiring input from a human user.