This relates to graphics processing units and particularly to the use of graphics processing units to handle latency sensitive applications.
Because of advances in graphics processing unit architecture, graphics processing units are being relied upon to handle ever more complex operations. However, in connection with latency sensitive applications, graphics processing units have some drawbacks. Typically in order for a task to be handled by the graphics processing unit, it must be assigned from a central processing unit. This assignment operation involves the job passing through schedulers and command buffers that are part of a vertical stack that generally increases the time that many simple operations require. Because there may be a large number of these simple operations such as launching threads on graphics processing units, assignment of such tasks to graphics processing units in latency sensitive applications may not be effective.
Generally, despite the high programmability and parallel computation available with graphics processing units, accelerated packet processing on graphics processing is difficult because of the high central processing to graphics processing unit communication overhead and high costs of launching threads on a graphics processing unit.
The nature of network packet processing application is latency sensitivity. Typically graphics processing unit working threads rely on the host or central processing unit to notify the graphics processing unit when the producer data is ready to be processed. The latency introduced from thousands of thread launches per second plus the communication overhead between the central processing unit and the graphics processing unit is inconsistent with the latency requirements of network applications such as packet forwarding.
While some techniques such as batching do let kernels process a large number of tasks in order to amortize this overhead, with streaming-type applications like packet forwarding, which is sensitive to latency, batch processing is not practical. One reason is that one can only batch a certain number of packets before being processed.