A computer network is a geographically distributed collection of interconnected sub-networks for transporting data between nodes, such as computers. A local area net-work (LAN) is an example of such a sub-network; a plurality of LANs may be further interconnected by an intermediate network node, such as a router or switch, to extend the effective “size” of the computer network and increase the number of communicating nodes. The nodes typically communicate by exchanging discrete frames or packets of data according to predefined protocols. In this context, a protocol consists of a set of rules defining how the nodes interact with each other.
Each node typically comprises a number of basic components including a processor, a memory and an input/output (I/O) interface. To increase performance, a node may embody a multi-processor environment wherein a plurality of processors is coupled to a shared I/O interface via a management module. Typically, a workload is shared among the multiple processors either on a per-transaction basis or based on function. The processors may be general-purpose processors or processing cores of, e.g., a network processor. Often, all of the processors require access to the same shared interface in order to receive work (such as packets to be processed) and return the results of the processing. In this type of application or network processing system, individual processors may periodically fail, typically due to a software failure, and must be restarted. This temporary loss of one of the processors results in reduction of throughput, but should not affect the availability of the system.
In particular, a fault tolerant, high availability system must be able to recover cleanly from processor failures while minimizing the impact on the operation of the remaining processors. However, when a management module manages the data for multiple processors, it is often difficult to recover from a processor failure without affecting the data from the other processors, particularly when the data is intermixed within common destination ports and queues of the system. In a high availability network processing system, such as a multi-processor flow device, the data may be embodied as packets that comprise user information used to create a user session. An example of such a user session is a voice or packet “call” between two users over the computer network. A large number of user sessions may be allocated to the processors. If one processor fails, some user sessions may be lost, but the remaining sessions remain active so that the percentage of outage is relatively small.
An application particularly suited for this type of high-availability, multi-processor environment is a wireless networking application using, e.g., cellular phones to exchange information among the users. For this type of application, the multi-processor flow device is configured to provide session processing operations for each user. Wireless networks perform functions similar to that of “wired” networks in that the atmosphere, rather than the wires, provides a path over which the data may flow. Many users share the atmosphere using techniques that facilitate such sharing. Examples of a shared wireless network include a wireless local area network and a wireless asynchronous transfer mode network.
When the data of a failed processor is intermixed within shared queues of the multi-process flow device, it is desirable to remove (purge) that potentially “bad” (corrupted) data from the queues without affecting otherwise “good” data stored in those queues from the remaining processors. One prior approach used to recover from a processor failure within a multi-processor flow device involves resetting the management module and reinitializing all queues managed by that module. However, this approach results in lost data not only for the processor that failed, but also for all processors managed by the module.
Another prior approach used to recover from such a processor failure involves complete parsing of the queues by a host processor of the multi-processor flow device, searching for any corrupted data remaining from a failed processor and purging that corrupted data from the system. Yet, this approach results in lost performance and wasted memory bandwidth of the flow device. Therefore, it is desirable to purge corrupted data issued by a failed processor of a multi-processor flow device from queues of the device in an efficient manner that does not affect data from the remaining processors. The present invention is directed to solving this problem by providing a technique for recovering from a failure to a processor of a multi-processor flow device, such as an intermediate network node, without disturbing proper operation of the other processors.