Typically, any execution of network application can be divided into multiple stages of processing. For e.g. a web server processing can be broken down into following different stages:                Layer 2, Layer 3 and flow processing        TCP protocol stack processing        SSL protocol stack processing        HTTP protocol stack processing        Application written on top of HTTP protocol        
Typically, an application itself could be divided into multiple stages depending on the functionality. Basically, each stage is a well contained function and it should provide a well defined API. When a packet enters the system, it goes through different stages of processing one after another. Some sort of communication mechanism need to be implemented for inter-stage communication. With the multiple cores available to execute an application, different models of execution are possible by distributing the processing of stages in different ways.
In a pipeline execution model as shown in FIG. 1A, each core will be dedicated to perform one of the stages in the application processing. Here Pn is the nth packet, Fm is the mth flow and Sk is the kth stage. In order to perform all the required functions of all stages, the packet will traverse from one core to another core. This model works best under the following conditions:
Every stage performs equal amount of processing
The number of stages is equal to the number of processing cores.
It is uncommon that an application can be divided into stages which require the same processing capability. If all the stages are not equal, the performance of the function will be limited by the weakest stage in the pipeline. In order to balance the stage processing and utilize all the cores, it may be required to perform the same function in multiple cores.
In a parallel execution model as shown in FIG. 1B, all the stages of the application processing is replicated in all the cores of SOC and the traffic is load balanced so that all the cores are utilized efficiently. In order to load balance the traffic, either couple of cores need to be dedicated for this purpose or introduce one more stage to load balance the traffic. Also, application states need to be managed in the shared memory if any packet can be sent to any core.
Challenge with this scheme is to load balance the traffic efficiently without breaking the application semantics. For example, if multiple cores of a system chip (e.g., system=on-chip or SOC) are being used for providing TCP termination functionality, the load balancer needs to follow the event ordering (i.e. if a core is working on a TCP segment for a given connection, no other core should work on any event of the same TCP connection). A typical way to solve this TCP ordering and connection atomicity problem is to use some kind of a hash to dispatch the packets so that packets of a given connection will always end up with the same core there by creating an implicit execution order. Using the hash may create an imbalance and some of the cores may be underutilized.