Network devices, e.g., firewalls, switches, routers, storage/compute servers or other network attached devices often utilize multiple core processor systems or multiple-processing unit systems to achieve increased performance. However, processing streams of data, such as network packets, with systems having multiple processing units can present many programming challenges. For example, it is often difficult to move processing of a packet or set of packets from one processing unit to another, such as for load balancing across the processing units. Transitioning program execution from one processing unit to another can be difficult and often requires brute force movement or mapping of state, cached data, and other memory pieces associated with the program execution. Maintaining consistency of cached data and other memory across processing units while achieving high-throughput and utilization is often extremely technically challenging. For example, when using coherent memory, significant processing overhead and delays may result from operations performed by a memory coherence protocol. When using non-coherent memory, the overhead of the coherence protocol is avoided, but some processing units might not have access to data cached by another processing unit.