Digital media servers such as Web-based servers and video-on-demand servers typically include a number of functional components including components for storing digital media, converting such media from file format to wire format, and scheduling the delivery of media packets. During operation, a media server accepts incoming requests for content from clients or administrators and delivers media packets to clients via a network.
Most digital media servers employ a PC-based architecture and run a variety of software components to provide the above-described functionality. Great effort is made during the design of such software components to ensure that they are fully debugged and free from defects. As a practical matter, however, many defects are not discovered during the design phase and are exposed only when the software is put into actual operation.
Defects discovered during system operation are often corrected by performing a software upgrade. Software upgrades are also sometimes performed to supplement or improve server functionality, thus extending a server's competitive life.
To upgrade an executing software component, the component must be stopped, and the replacement version loaded into memory and run. During this period, services normally provided by the component are unavailable.
The consequences of a defect in a media server's operating system may be even more severe. Operating systems are typically designed around a number of tightly coupled modules that supply abstract data structures such as files, memory storage, input/output streams, semaphores, processes, and threads to other programs. Application programs access these abstract structures through an application programming interface (API). A change made to one of these structures may cause side-effects in other structures or modules. Generally, replacement of operating system-level components requires reloading the entire operating system, and is accomplished during a reboot of the server. Thus, operating system-level resources cannot be upgraded without taking the media server offline, and rebooting may take a considerable amount of time before these services can be restored.
Offline servers are unable to accept incoming requests or deliver content to existing sessions. Consequently, an offline server may affect the availability of an entire service network unless adequate redundant servers are available
FIG. 1 illustrates a typical upgrade process and its effect on network availability. As shown in FIG. 1, in step 105, an upgrade is initiated. Next, in step 110, an upgrade package is detected. If the upgrade package cannot be downloaded, the upgrade process terminates (step 190).
Before the upgrade can be installed, pre-upgrade management steps 120 are performed. In particular, in step 125, user sessions are either thinned or transferred to unaffected machines. Next, in step 127, services affected by the software to be upgraded are discontinued.
Next, upgrade process steps 140 are performed. In particular, in step 145, the settings and properties of the system are either copied or modified. In step 147, new components are copied from the upgrade package. Although some media servers may permit the local or remote transfer of data into the server while it is operating, some service disruption is typically necessary to effect the-upgrade, and in most cases the server must first be brought offline.
Next, post-upgrade process steps 160 are performed. In particular, in step 165, the media server's power is cycled off and then back on (if the server was taken offline), and services provided by the upgraded software are restarted. A single power cycle may last anywhere from a few seconds to several minutes. The amount of time required for a single power cycle depends on how long the server needs to perform an orderly shutdown of running applications before powering off plus the time needed to reboot the server and restore the applications after powering back on. Only after these events are completed can the server begin to accept new user sessions (step 167).
The above process may significantly affect system operation, especially in cases of system-wide upgrades such as an upgrade of all system APIs and low-level drivers. A typical digital-media company may have dozens of on-line media servers affected by such an upgrade. Although the company may select a time for the upgrade when server usage is at its lowest point, the upgrade may still disrupt service to some extent if it necessitates shutting down media servers. At a minimum, the company may experience loss of revenue for the downtime and risk customer dissatisfaction
To avoid such service disruptions, companies often maintain excess server capacity or redundant systems to handle traffic channeled away from affected servers during an upgrade. But redundant systems introduce additional overhead cost and in many cases are not available.