1. Field of the Invention
The invention relates to computer systems in critical environments. Particularly, the invention relates to redundant computer clusters and the facilitation of switchovers therein using storage access shift between computer units, for example, server nodes.
2. Description of the Related Art
Reliability is an important factor in communication networks or in general in any other critical system environment. It is important that continuous and uninterrupted service experience is provided to end-users despite the fact that there may be failures in computer hardware and software. It is important that interruptions in transactions are minimized. Examples of transactions may include data communication sessions and database transactions. Further, it must be possible to perform management actions in computer systems without affecting end-user experience. For example, it must be possible to activate, deactivate, add, remove and replace subsystems as transparently and as quickly as possible. In critical environments subsystems comprising hardware and/or software are replicated so that there are backup subsystems ready to replace subsystems that become faulty. Such subsystems are often hot-swappable. Subsystems may be replicated so that there is one backup subsystem for each active subsystem or so that there is one backup subsystem for a group of subsystems. By a subsystem in this case is meant a set comprising at least one hardware unit and/or a set comprising at least one software component. A hardware unit can be, for example, a processor unit, an interface card and a communication link. A software component can be, for example, a group of processes or a group of threads executing in a processor unit. A subsystem may also comprise both software and hardware. For example, a communication link subsystem may comprise a line interface card and a set of processes executing in an associated processor unit. Typically, there are a number of similar line interface cards each of which comprises a subsystem including line interface card hardware and software executing in a processor unit to which the line interface card is associated. Typically, the backup subsystem i.e. replica in the case of a software process is executing in another computer unit than its active pair process.
There is a consortium called Service Availability Forum (SA Forum), which is developing two layers of standard carrier-grade interfaces. A system is said to be carrier grade, if it has ability to provide uninterrupted service without loss of service continuity and delivery. The SA Forum specifications have an application interface and a platform interface. The application interface provides access to a standard set of tools for application software to use in order to distribute its processing over multiple computing elements. The tools will respond to failures of those elements without loss of service continuity and delivery to any user. The tools are provided through management middleware that conforms to the application interface specification. The platform interface is used to access operating system level. Its purpose is to hide the operating system level differences across different platforms. In SA Forum specification concepts there are Service Groups (SG), which comprise at least one Service Unit (SU). In turn each SU comprises at least one component. A component may be a software process or a thread. A component may have associated to it also hardware units. In other words an SU is a subsystem, which can be an active subsystem or a redundant subsystem acting as a replacement for the active subsystem. An SU is replicated in the sense that in an SG there is at least one SU in active state and at least one SU in standby state. The SU in standby state will act as a backup replica of the SU in active state. If the active SU becomes failed or is to be taken down for maintenance, the replica SU becomes active and takes over the tasks of the SU failed or taken down. The concepts from the SA Forum specifications are used herein for illustrative purposes. They may be replaced by other equivalent concepts. The invention and its embodiments are thus not limited to systems and implementations that are explicitly SA Forum specification compliant.
Reference is now made to FIG. 1, which illustrates the aforementioned SA Forum specification concepts. In FIG. 1 there is a redundant two-unit computer cluster having computer units 110 and 112. The computer units are connected using a communication channel 104. Communication channel 104 may be, for example, an Ethernet segment or a PCI bus. There are three SGs, namely SGs 140-144. In each SG there are two SUs. In SG 140 there is SU 120 and 130, in SG 142 SU 122 and SU 132 and in SG 144 SU 124 and 134. SUs 120, 132 and 124 are in active state and SUs 130, 122 and 134 in standby state. For each active SU, there is a spare SU in standby state. For instance, in case there is a switchover in SG 142 due to some failure or management action in SU 132, SU 122 becomes active and takes over the tasks of SU 132. The state of SU 132 becomes “standby” or “not present” or any other state, which reflects the situation in SU 132. If failure occurs at computer unit level and computer unit 110 fails, SUs 130-134 in computer unit 112 must take the place of the peer SUs 120-124 in the failed computer unit 110.
In redundant computer clusters, for example, in active-standby redundancy, redundant applications will usually access a given shared data storage resource only via one unit i.e. node at a time because of software limitations. By a data storage resource is meant in this context, for example, a File System (FS), a Software RAID (Redundant Arrays of Independent Disk) or logical volumes of Logical Volume Management (LVM). By data storage access establishment is in this context meant, for example, File System (FS) mounting, software RAID (Redundant Arrays of Independent Disks) startup or logical volume deployment of Logical Volume Management (LVM). It should be noted that e.g. when a Software RAID is started up in a unit, it involves only the establishment of readiness in operating system level to read from or write to the Software RAID. The file systems usually have been created earlier so it is not question of Software RAID set-up. A read-write access to a data storage resource can be established only from one unit at a time in order to avoid e.g. file system crash or any incoherent state of the data storage resource. By the read-write access of a data storage resource is meant an access, which allows that the entity that established the access to the data storage resource to modify data in the data storage resource. If a unit has established read-write access to a given data storage resource, usually no other units may establish even a read access to the data storage resource. This is particularly the case in file system read-write mounting.
In the case of read access establishment only reading of the data storage resource is allowed for the entity that performed the access establishment.
A Software RAID behaves from user point of view like any block device such as a partition on a single disk. In other words, it is a virtual device. Onto a Software RAID file systems may be created like to any other block device. In other words, it may be formatted. Examples of file systems are ext2 and Reiserfs familiar from the Linux operating system. In Linux the mounting of a given file system comprises the attaching of the directory structures contained therein to the directory structure of the computer performing the mounting. The directory structure is mounted at a specified mount point, which is a certain subdirectory within the directory structure. During the mounting file system directory structures retrieved from the storage volume may be cached at least partly by operating system in computer volatile memory. Some other file system information may also be retrieved from the storage volume and cached during the mounting, for example, disk space allocation information. The mounting of file systems is essentially similar in any present operating system such as Microsoft Windows. The differences pertain mostly to the mechanisms how files on the mounted file systems are identified. For instance, instead of attaching them to a single directory tree, in Windows mounted file systems are identified using letters such as A, D, E, F and so on. Usually letter C denotes local hard disk drive.
By mounting is meant herein that the file system to be mounted is prepared ready for general file access operating system services such as open, read, write and close in the system that performed the mounting. The file access operating system services are such that they operate in terms of individual identifiable files instead of bulk secondary storage.
It is possible for multiple units to access a given file system so that they merely read-only mount the file system. In practice an active unit or active software entity, that is, an active subsystem, will be the one accessing the file system and owning its read-write mount. Similarly, in the case of a Software RAID, an active unit or active software entity will be the one establishing and owning read-write access to the Software RAID. In SA Forum terminology this means that the active SU will own the read-write access to the data storage resource. This means that it owns e.g. file system mount, Software RAID access or LVM access. If the active entity i.e. the active SU gets failed or if the operator has to switch the active-standby roles, for example, due to software upgrades or any other management actions, the data storage resource access has to be shifted safely from the old SU to the new SU, that is, usually from a first unit to a second unit.
Reference is now made to FIG. 2, which illustrates the policy discussed above. In FIG. 2 there is a redundant two-unit computer cluster having computer units 110 and 112. The computer units are connected using a communication channel 104, which is a local area network (Ethernet). The computer units are connected to a disk storage unit 200 using a fiber channel 202, which provides high-bandwidth access. The disk storage unit has volumes 250, 252 and 254. The volumes have been assigned volume labels V1, V2 and V3, respectively. In this case a volume is an abstraction that may in practice be a hard disk drive, a group of hard disk drives or a partition within a hard disk drive comprising a specified number of cylinders from that hard disk drive. A volume may also be a RAID logical volume. The concept volume represents a block of storage, which appears logically contiguous and can be accessed using standard mechanisms. A file system may be created onto the volumes. The file system may be, for example, a Linux ext2 or Reiserfs. Other examples of file systems are NTFS and FAT32 from the Microsoft Windows operating system. The file system comprises the directory, file and access data structures and their storage formats on the volume. File systems 260, 262 and 264 have been created onto volumes 250, 252 and 254, respectively. During the file system creation step, the file system data structures are allocated and created to the volume. In the case of FIG. 2 the file systems 260, 262 and 264 are Linux ext2 file systems. Computer units 110 and 112 operate under operating systems 220 and 222, respectively. Operating system 220 has read-write mounted file system 260 and read mounted file system 264. This is illustrated in FIG. 2 using the directions of the arrows between the operating system and the file systems. Whereas, operating system 222 has read-write mounted file system 262 and read mounted file system 264. This reflects the principle that if a single unit read-write mounts a given file system, other units may not mount it. If a given volume is only read mounted by each mounting unit, several units may mount it. If an active SU executing in computer unit 110 should move to standby state and a passive SU executing in computer unit 112 should become active, a problem arises if that SU needs read-write access to file system 260. When the backup SU executing in computer unit 112 enters active state, file system 260 remains unmounted on computer unit 112 and SU has no possibility to read from or write to file system 260. A problem of the solution such as illustrated in FIG. 2 is that the file system mounting occurs at native operating system e.g. at Linux level. If there are switchovers that occur at SG level where a standby SU must take the place of an active SU, the operating system may not be affected or informed. Therefore, such SG level switchovers are transparent at operating system level.
In order to overcome the problem mentioned above some solutions from prior art can be applied. One such solution is to use file systems 260 and 262 from computer units 110 and 112 using Network File System (NFS). In the NFS it is possible for both computer units to access both file systems in read-write mode simultaneously. However, only separate files within the file system become simultaneously accessible. Whenever user opens a given file for writing, it becomes read-only accessible to other simultaneous users.
Reference is now made to FIG. 3, which illustrates the use of a network file system such as the NFS. In FIG. 3 there is a redundant two-unit computer cluster having computer units 110 and 112. The computer units are connected using a communication channel 104. Communication channel 104 may be, for example, an Ethernet segment or a PCI bus. Computer units 110 and 112 are connected to a file server 300 running the NFS. File server 300 is connected to a disk storage unit 200 using fiber channel. Disk storage unit 200 has file systems 260 and 262 as in FIG. 2. File server 300 has the NFS, which enables remote clients such as computer units 110 and 112 establish read-write access to file systems actually mounted only on file server 300. The NFS mounting imitates in remote clients local mounting. Now it is possible for computer unit 110 to perform read-write NFS mount to both file systems 260 and 262. There are now read-write NFS mounts 320 and 322 from computer unit 110 to file systems 260 and 262. Similarly, there are now read-write NFS mounts 324 and 326 from computer unit 112 to file systems 260 and 262.
The drawback of the prior art NFS mount based solution such as illustrated in FIG. 3 is poor performance. The use of file server 300 and NFS slows down access to file systems 260 and 262 significantly compared to the case where computers units 110 and 112 interface the disk storage unit 200 and are able to move large sequences of disk blocks to/from disk storage unit 200 without an another computer unit and its intervening network file system software. Additionally, the disk access has to be shifted safely from the old SU to the new SU so that there is no overlapping moment when the units access same logical storage entity e.g. a file system simultaneously in read-write access mode. In this way file system consistency can be retained. Yet another drawback of the prior art NFS mount based solution such as illustrated in FIG. 3 is that file server 300 becomes a single point of failure in the system. If file server 300 is replicated, the same problems arise as in FIG. 2, because replicas for file server 300 would need simultaneous read-write mounting of file systems 260 and 262. Therefore, the situation is not improved essentially.