The majority of Internet outages are directly attributable to software upgrade issues and software quality in general. Mitigation of network downtime is a constant battle for service providers. In pursuit of xe2x80x9cfive 9""s availabilityxe2x80x9d or 99.999% network up time, service providers must minimize network outages due to equipment (i.e., hardware) and all too common software failures. Service providers not only incur downtime due to failures, but also incur downtime for upgrades to deploy new or improved software, hardware, software or hardware fixes or patches that are needed to deal with current network problems. A network outage can also occur after an upgrade has been installed if the upgrade itself includes undetected problems (i.e., bugs) or if the upgrade causes other software or hardware to have problems. Data merging, data conversion and untested compatibilities contribute to downtime. Upgrades often result in data loss due to incompatibilities with data file formats. Downtime may occur unexpectedly days after an upgrade due to lurking software or hardware incompatibilities. Often, the upgrade of one process results in the failure of another process. This is often referred to as regression. Sometimes one change can cause several other components to fail; this is often called the xe2x80x9cripplexe2x80x9d effect. To avoid compatibility problems, multiple versions (upgraded and not upgraded versions) of the same software are not executed at the same time.
Most computer systems are based on inflexible, monolithic software architectures that consist of one massive program or a single image. Though the program includes many sub-programs or applications, when the program is linked, all the subprograms are resolved into one image. Monolithic software architectures are chosen because writing subprograms is simplified since the locations of all other subprograms are known and straightforward function calls between subprograms can be used. Unfortunately, the data and code within the image is static and cannot be changed without changing the entire image. Such a change is termed an upgrade and requires creating a new monolithic image including the changes and then rebooting the computer to cause it to use the new. Thus, to upgrade, patch or modify the program requires that the entire computer system be shut down and rebooted. Shutting down a network router or switch immediately affects the network up time or xe2x80x9cavailabilityxe2x80x9d. To minimize the number of reboots required for software upgrades and, consequently, the amount of network down time, new software releases to customers are often limited to a few times a year at best. In some cases, only a single release per year is feasible. In addition, new software releases are also limited to a few times a year due to the amount of testing required to release a new monolithic software program. As the size and complexity of the program grows, the amount of time required to test and the size of the regress matrix used to test the software also grows. Forcing more releases each year may negatively affect software quality as all bugs may not be detected. If the software is not fully tested and a bug is not detectedxe2x80x94or even after extensive testing a bug is not discoveredxe2x80x94and the network device is rebooted with the new software, more network down time may be experienced if the device crashes due to the bug or the device causes other devices on the network to have problems and it and other devices must be brought down again for repair or another upgrade to fix the bug. In addition, after each software release, the size of the monolithic image increases leading to a longer reboot time. Moreover, a monolithic image requires contiguous memory space, and thus, the computer system""s finite memory resources will limit the size of the image.
Unfortunately, limiting the number of software releases also delays the release of new hardware. New hardware modules, usually ready to ship between xe2x80x9cmajorxe2x80x9d software releases, cannot be shipped more than a few times a year since the release of the hardware must be coordinated with the release of new software designed to upgrade the monolithic software architecture to run the new hardware.
An additional and perhaps less obvious issue faced by customers is encountered when customers need to scale and enhance their networks. Typically, new and faster hardware is added to increase bandwidth or add computing power to an existing network. Under a monolithic software model, since customers are often unwilling to run different software revisions in each network element, customers are forced to upgrade the entire network. This may require shutting down and rebooting each network device.
xe2x80x9cDynamic loadingxe2x80x9d is one method used to address some of the problems encountered with upgrading monolithic software. The core or kernel software is loaded on power-up but the dynamic loading architecture allows each application to be loaded only when requested. In some situations, instances of these software applications may be upgraded without having to upgrade the kernel and without having to reboot the system (xe2x80x9chot upgradexe2x80x9d). Unfortunately, much of the data and code required to support basic system services, for example, event logging and configuration remain static in the kernel. Application program interface (API) dependencies between dynamically loaded software applications and kernel resident software further complicate upgrade operations. Consequently, many application fixes or improvements and new hardware releases, require changes to the kernel code whichxe2x80x94similar to monolithic software changesxe2x80x94requires updating the kernel and shutting down and rebooting the computer.
In addition, processes in monolithic images and those which are dynamically loadable typically use a flat (shared) memory space programming model. If a process fails, it may corrupt memory used by other processes. Detecting and fixing corrupt memory is difficult and, in many instances, impossible. As a result, to avoid the potential for memory corruption errors, when a single process fails, the computer system is often rebooted.
All of these problems impede the advancement of networksxe2x80x94a situation that is completely incongruous with the accelerated need and growth of networks today.
The present invention provides a computer system having a control process and a device driver process in communication with the control process. The computer system further includes a local back-up process for facilitating recovery of the device driver process if the device driver process is terminated. The local back-up is independent of both the device driver process and the control process.
In one aspect, the device driver and the local back-up process communicate through a check-pointing procedure. The computer system can also include a remote back-up process for facilitating recovery of the control application, where the local back-up is independent of both the device driver process and the control process.
In a related aspect, the invention provides a network device having a control plane and a data plane. The control plane includes a control process for establishing and terminating network connections. The data plane includes a device driver process for transmitting data over network connections established by the control process. The network device further includes a local back-up process for facilitating recovery of the device driver process if the device driver process is terminated. The local back-up process is independent of both the control process and the device driver process.
In another aspect, the local back-up process and the device driver process of the network device communicate through a check-pointing procedure to allow the local back-up process to store certain state information representing the active state of the device driver process.
The network device can further include a remote back-up process for facilitating recovery of the control application if the control application is terminated, where the local back-up process is independent of both the device driver process and the control process.
The control process can be an Asynchronous Transfer Mode application, and device driver can be an Asynchronous Transfer Mode driver corresponding to the Asynchronous Transfer Mode application. Alternatively, the control process can be a Multi-Protocol Label Switching application, and the device driver can be a Multi-Protocol Label Switching driver corresponding to the Multi-Protocol Label Switching application. Further, the control process can be an Internet Protocol application, and the device driver can be an Internet Protocol driver corresponding to the Internet Protocol application; or the control process can be a Frame Relay application and the device driver can be a Frame Relay driver corresponding to the Frame Relay application.
The invention further provides a method for operating a network device. The method calls for establishing and terminating network connections through a modular control process within a control plane. Data is transmitted over network connections established by the control process through a modular device driver process within a data plane, and active state information is communicated between the device driver process and a local back-up process.
The method can further include the steps of terminating the device driver process, restarting the device driver process, and communicating back-up state information stored by the local back-up process to the restarted device driver process to facilitate recovery of the device driver process. In addition, the method can include a step of continuing to establish and terminate network connections through the control process when the device driver process is terminated and restarted.