The present invention generally related to a multi-processor system employed in communication network environments requiring maintenance and/or upgrade of the processors and particularly to systems using multiple modem Digital Signal Processing (DSP) devices, operational by execution of external and re-loadable software (or firmware) code requiring maintenance and/or upgrade thereof with minimal impact on users of such devices while maximizing utilization of the available capacity of the DSP devices.
In recent years, many hardware components are being packaged together as a unit called a module. These components can be DSPs (Digital Signal Processor), controllers, Central Processing Unit (CPU) devices, and the like. An example of a DSP device is a modem used for communication between two electronic device such as computers, embedded devices, etc. As an Example, a well-known manufacturer of network communication equipment, known as Cisco Systems, Inc., in San Jose, Calif., develops and manufactures access servers employing a particular type of modem device, MICA. In some of its access servers, such as the models 5200, 5300 and 5800, 6 or 12 MICA modems are packaged into a module. These types of access servers are used as gateways between the PSTN (Public Switching Telephone Network) and data networks, such as Internet.
A network access server (NAS) converts data traffic from the PSTN protocol (timeslot) managed data to packetized data used within data networks such as the Internet. A NAS is essentially a specialized type of router having a T1/E1 controller card. The T1/E1 controller card includes hardware for multiplexing and de-multiplexing Time Division Multiplexed (TDM) signals coupled onto T1 or E1 lines. That is, the TDM hardware separates the calls that are coupled onto a PSTN trunk, based upon assigned time slots, into individual calls. A router is a device that can select a path that information traveling through a packet switching network environment should take thereby requiring the router to have an understanding of the network and how to determine the best route for the path.
A design consequence of grouping processors (or modems) into modules, which share a mutual memory space and/or controller, is that reloading each processor cannot be accomplished on an individual basis. Instead, all processors must be loaded at the same time. This has the benefit of speeding up initial loading when no processors are active as the software will only be transferred once for multiple processors. However, this design has adverse consequences when trying to reload the processors while the system is active or operational. That is, reloading a processor that is in use terminates any end-user activity rather abruptly, causing significant frustration to the user. Reloading a processor is done for various reasons, such as upgrading the software or for maintenance purposes and the like.
Oftentimes, some hardware components, such as DSP devices, fail to function properly and will need some form of maintenance. For instance, when a modem DSP fails, i.e. hangs up at a given point in a modem call, reloading or downloading the modem's software usually resolves the problem as it returns the modem to a known state, at which point the modem is again capable of processing new calls. The need to reload a processor may also occur when the current software is out-dated and an updated version of the software needs to be downloaded. In this case, downloading is for the purpose of upgrading the software. A modem DSP is a DSP device that is configured to operate as a modem device by, for example, programming the DSP device in a manner so as to function as a modem.
However, even if a particular modem device in a module needs downloading, the rest of the modems on the same module may be active and successfully processing other incoming calls. At present, one approach to downloading is to “busy out” all the modems on a given module by making all the modems in the module unavailable to new requests by the system so that no new calls can be allocated thereto. Once there are no more active calls being processed by the given module, the module is available for having its software downloaded to all modems without impacting any end-users. While this approach offers a graceful way of reloading the modems from an end-user's perspective, it has the disadvantage of reducing the capacity of the system—network access server. For instance, to download the software for one faulty modem, 5 or 11 other properly functioning modems on the module are held inactive, sometimes for days, waiting for all end-users to end their modem connection before downloading can be achieved.
Another approach, which attempts to minimize the impact of downloading modules on the access server's capacity, is to schedule maintenance to the off-hours, at a time when fewer users may be logged onto the system. This approach basically accepts the impact of forcefully dropping any end-user calls to perform the maintenance task necessary. The disadvantage with this approach is in the possibility of taking out an entire module of active end-users to recover one malfunctioning modem. Even though the impact on the capacity is not as severe as in the previous approach discussed hereinabove, nevertheless, the end-users are disconnected forcefully from the access server, causing significant frustration to the end-users. This is especially the case if a large number of modules are to be scheduled for reloading at the same time in the off-hours, thereby affecting many access servers' end-user customers.
Modems can be deemed defective in multiple ways. Systems tests can be performed on inactive modems in order to test their integrity. Furthermore, statistical analysis can be used to identify defective modems. In this case, a modem is deemed defective if it fails to establish a connection over several consecutive calls with various end-users. This is done to ensure that the problem is originating from the modem and not from the end-user as the possibility exists that the equipment on the side of the end-user is not functioning properly and/or the end-user has simply disconnected before the call can be completed. In making several calls, the modem is likely to be connected to several users and if the calls are unsuccessful, there is a strong likelihood that the problem originates form the NAS' modem rather than the end-users. This is the preferred method for identifying defective modems as self-tests often pass even when there is a problem.
It has been the inventor's experience that modems exhibit a success rate of 90%->95% under normal operation. That is, 90% to 95% of all calls which are allocated to a modem successfully connect, link, train up, negotiate, and finally enter a steady state such that the client (or user) and the access server modems can transfer data. The 5% to 10% failure rate can be associated to numerous issued such as incompatible equipment, clients disconnecting, etc. Thus, in at lease some prior art systems, it is expected that there at least 1 call in 10 attempts will fail.
Statistically:
The probability of 1 failed call attempt is: 1/10
The probability of 2 failed call attempt is: 1/10× 1/10
The probability of 3 failed call attempt is: 1/10× 1/10
The probability of n failed call attempt is: ( 1/10)n 
As such, according to basic statistics, even under a situation of where the success rate is 90%, the probability of a good modem failing to enter steady state, once allocated, drops significantly after each failed call attempt. Thus, where the value of “n” is as small as 10, one can safely assume the modem to be actually bad and mark the modem accordingly. As used in this document, “n” will denote the “modem recovery threshold <value>”.
As noted previously, modem functions are implemented in a modular fashion whereby 6 or 12 modems are allocated to a single controller device overseeing the operation of the modem DSPs. An unfortunate consequence of this design is that the network access server is unable to download DSP firmware to a single modem of the module and rather requires all 6 or 12 modems to be reloaded at the same time. This issue is not significant when initializing the network access server as no active calls are being processed at that time. But, this issue is significant when trying to load a firmware code for either recovery or for upgrade purposes. A problem arises in reloading the modem module with minimal impact to the end-users and to the network access server.
As earlier noted, there are a couple of ways prior art techniques have addresses this problem. One is to “busyout” the modem module where basically all modems of the module are locked (or act as though they are busy) which will disallow new calls to be allocated on any of the modems until the “busyout” status is removed—usually after the modem module is reloaded. Existing calls on modems are not affected when the modem module is in the “busyout” state.
One way to evaluate the effectiveness of a modem module downloading technique is by observing the modem module at various times. During an hourly utilization analysis, modem usage is actually quite predictable. Telecommuters who use modems between 7:00 am and 6:00 pm provide a consistent call volume throughout the business day. The nightly “Internet surfers” “surf” the web between 6:00 pm and 2:00 am. As a result, modem usage between 2:00 am and 7:00 am is typically at its lowest.
The “busyout” technique is currently widely used for firmware upgrades. However, it has a significant drawback. A single modem end-user who decides to stay connected for days can severely impact the capacity of the network access server if the module is left in a “busyout” state until all calls drop. If there is one active call in a module of twelve modems where the remaining eleven modems are free, there can be a serious impact on a network access server's ability to perform at top capacity, especially during high load time periods. Accordingly, the need arises for a modem recovery method and apparatus for reloading firmware code with the least impact possible while maximizing successful reloading attempts.
In light of the above, it is desirable and indeed necessary to have a recovery mechanism for modem modules employed in network communications equipment which minimizes any adverse impact to the end user while maximizing the available capacity of the system. This is especially needed for systems where there is a high demand for available modems such as Internet Service Providers (ISP) providing access to the Internet. For such systems, it is important to have as many modems available as possible at any given time especially during the peak hours when many users place calls.
Furthermore, currently, maintenance of a system that includes modem devices, such as an access server and the like, is performed manually. For example, if the system needs to be upgraded, the operators have to come in during off-hours, such as 3:00 AM, to perform their maintenance tasks. It is therefore desirable to automate the process of maintenance so that various equipment modules can be self-sustainable. That is, when a problem develops within a module there is an algorithm which detects the problem, designates the module for maintenance, performs the required maintenance and places the module back into operation with as little impact as possible to end-users activity and overall system capacity.
Therefore, the need arises for minimizing end-users impact while, at the same time, maximizing the available capacity for processing requests through systems that contain modular reloadable processors such as modem DSPs and to do so automatically.