Telecommunication systems process telephone services. A call processing server may be used, for example, to process calls for call center operations. Reliability of such telecommunication systems is often critical. One way of achieving reliability and fault tolerance in telecommunication systems is by duplication of call processing servers, which can be expensive. For reasons of cost, call processing servers often include simplex servers (i.e., single processor/single disk servers).
When access is lost to a disk drive (or other secondary data storage device) in simplex servers, it can lead to server failure and complete loss of telecommunication access. The server failure causes the server to stop call processing and may lead to an entire call center being shut down until the server is repaired and placed back into service. Loss of call processing services at inopportune times can have dire consequences on a call processing center. During high call volume periods, a loss of service can result in irritated customers and lost sales opportunities.
A processor may lose access to a disk drive as a result of any number of conditions that affect the chain of functional units involved in accessing a disk. For example, the disk itself may become faulty, its controlling disk file controller may fail, the channel connecting the disk memory subsystem to the processor may fail, or the direct memory access controller or disk driver that is attempting the access may fail. The failure of access may be either total failure or just an unacceptably high error rate.
U.S. Pat. No. 4,608,688 issued to Hansen et al. discloses protecting against a loss of access to both duplicated system-essential disk drives in a duplex processing system in which processes are swapped between a main memory and a pair of duplicated disks. After one of the duplicated (redundant) disks fails, the system identifies processes designed as essential to the operation of the system that are not resident in the main memory and swaps these processes into the main memory from the second duplicated disk. The system then locks all essential processes into the main memory so that they will not be swapped out of memory. In the event that the second duplicated disk also fails, the essential processes remain accessible to the processing system even upon the loss of access to both of the disks. The processing system kills non-essential processes and the system continues processing using only the essential processes. Where both of the duplicated disks of the system disclosed in the '688 patent fail simultaneously or near simultaneously, however, the essential processes are not locked into memory and the system would fail to operate.
The architecture of Hansen et al. fails to address a number of problems.
For example, it can fail to lock into memory processes that are critical or essential to providing continued call processing functionality. As will be appreciated, the difference between critical and noncritical processes is that the server is capable of continuing to provide specified features/services without access to noncritical processes but is incapable of providing the specified features/services without access to critical processes. Telecommunication software includes not only the primary call processing software, such as Avaya Inc.'s Communication Manager™, but also various software products from multiple vendors providing some of the same or a number of additional telephony features. Although the vendor of the primary call processing software can designate the processes critical to the continued provision of call processing capabilities in the event of a disk failure, determining which the criticality of processes in software of other vendors can be difficult at best. Such software often is not configured to identify which of its component processes are critical. In the event of a disk-related failure, the operating system, using first party locking techniques (in which the application or component process seeking to be locked requests the operating system to place it in a lock state), can typically lock into memory each of the critical processes in the primary call processing software but is typically unable (without alteration of source code) to determine the set of critical processes for software of other vendors.
It can fail to effect locking in a simplex server configuration. In Hansen et al., locking does not occur until loss of at least one duplicated secondary data storage devices. Upon loss of access to one of the devices, processes designated as essential to the system's operation and not resident in the main memory are swapped into the main memory from the other, still accessible, secondary data storage device. All essential processes are then locked into the main memory to prevent their removal therefrom. In contrast, a simplex server configuration commonly has only one secondary data storage device. It is not duplicated. Waiting until the device is inaccessible to effect critical process swapping and locking into main memory could prevent the system from remaining operational when the secondary data storage device is no longer accessible.
It can fail to prevent a process from becoming “hung” in the event of loss of secondary data storage disk access. As will be appreciated, a process makes read and write requests to an Integrated Device Electronics (IDE) driver for secondary disk access, and the IDE driver provides the request to the secondary disk. If the secondary disk is inaccessible, the IDE driver will futilely wait for an acknowledgment from the disk that the request is completed and thereby fail to respond to the process making the read or write request. As a result, the requesting process will be hung indefinitely and cannot be terminated.