1. Technical Field
The present invention relates generally to networks of computer systems, and more specifically, to a distributed operating system over a network of computer systems.
2. Related Art
An operating system (OS) is system software responsible for the control and management of computer resources. A typical OS enables communication between application software and the hardware of a computer. The OS allows applications to access the hardware and basic system operations of a computer, such as disk access, memory management, task scheduling, and user interfacing. Additionally, an OS is also responsible for providing network connectivity.
Computer networking provides a mechanism for sharing files and peripheral devices among several interconnected computers. Ideally, a computer network should allow all computers and applications to have access to all the resources of the network, optimizing the collective resources. To achieve this result, distributed operating systems have been developed. A typical distributed OS, however, suffers a variety of limitations. First, a distributed OS may be as a multi-layered system: one layer for the local environment, and a separate layer for the network environment. This results in two different operating systems having to be learned by developers and users. In addition, because the interfaces with the local and network layers are significantly different, an application program may be written to operate on one layer or the other, but can not be written to operate on both. That is, network versions of application programs may not run on individual computers and stand-alone versions may not run on networks.
Additionally, network software handles client computers and servers as different machines. If a user wishes to have a central computer provide files to a number of remote computers, then the central computer must be designated as a server, and the remote computers as clients. This may limit the flexibility of the network, because server and client computers are given different abilities by the operating system. For example, it may not be possible for two computers to share files with one another because one must be designated as the server, and the other the client. Generally the server may not access files stored on the client.
Computer network systems have been designed and optimized to handle a specified set of resources and configurations. For example, a mainframe computer system may comprise a mainframe computer with a large memory storage area and set of printers. Smaller terminals or computers may access this mainframe as clients in a manner specific to the network and software. Such a computer system may not have the flexibility to exploit communication developments as the Internet.
Message passing distributed operating systems have been developed to overcome these problems. An exemplary message passing operating system is described in U.S. Pat. No. 6,697,876 to van der Veen, et al. (“van der Veen et al.”), the disclosure of which is herein incorporated by reference. van der Veen et al. describes a distributed operating system with a single level architecture that may be applied to a flexible network environment, including an internet communication link, and to a stand-alone computer. This is done by use of a message passing operating system, and by sending off-node messages to network managers that are capable of directing and receiving the off-node messages.
In addition, interprocess control (IPC) in these systems should be reliable. Unfortunately, some prior distributed operating systems suffer transmission performance limitations dictated by their inability to (1) reliably handle transient communication failures and rapid node reboots, (2) provide a transmission protocol that adapts to link reliability, and (3) allow transmissions to occur over an arbitrary combination of media. Because nodes often may be connected through third party communication networks, such as the internet, it may be impossible to guarantee the integrity of physical communication lines between nodes. Transient communication failures can lock client processes, wasting resources and hampering the overall performance of the system.
Therefore a need exists for a reliable method for managing communications between nodes of a distributed message passing operating system that may improve the reliability of processing during transient communication failures and rapid node reboots, improve the performance of data transmission through an adaptive protocol that adapts to link flexibility and/or abstracts media selection to allow various policies to be implemented over arbitrary combinations of communication links.