In modern data networking, stream is an essential mode of communication. In the stream mode, the receiver end of a data channel is always expecting data until the channel is torn down.
The stream mode is used extensively in the Internet—at the transport layer, both TCP (transmission control protocol) and UDP (user datagram protocol) operate in the stream mode. At the application layer, videos and audios are often streamed directly from a server to a user device. For live broadcasts, stream is the only possible mode of communication.
A challenge in the streaming mode is the efficiency of data transfer from one stream to another. Data transfer is at the heart of numerous applications, and the efficiency of data transfer is often the critical factor to determine performance. For example, stream-mode data transfer is essential for TCP splicing, which is a useful method to accelerate all types of web applications.
TCP splicing is especially important for deploying TCP optimization solutions. It is estimated that by 2017, there will be 50 billion TCP-running devices online. When there is an improved TCP implementation, it is impossible to deploy the new solution by replacing all the TCP stacks that might be involved. A feasible deployment strategy is to insert TCP splicing boxes in the paths between the possible set of servers and the possible set of clients. TCP splicing is probably the only form of solution possible to optimize TCP throughput, when it is required not to touch the server, or the client, or both.
In TCP splicing, a TCP proxy is placed at an intermediate node between a sender and a receiver. The original TCP connection between the sender and receiver is replaced by a TCP connection between the origin sender and the proxy, and a second TCP connection between the proxy and the receiver. In TCP splicing, data sent from the origin sender is temporarily stored at the proxy—the stored data is sent to the receiver opportunistically as bandwidth to the receiver becomes available.
TCP splicing is a 2-edged sword: while it can increase the speed of data delivery, it also adds latency incurred at the proxy. At a TCP proxy, the largest component of latency is in the data transfer between the input stream and the output stream.
To further improve the speed of data delivery, a content cache is often added to all types of Web applications. Therefore, content caches are often added to TCP proxies.
With content caching, there are 3 types of data transfer at a TCP proxy—from an input stream to an output stream, from a content cache file to an output stream, from an input stream into a content cache file. It is critical to minimize the latency associated with all 3 types of data transfer.
The performance issue of data transfer is rooted at the extra data copying added to the straight data transfer. The extra data copying is often caused by the OS (operating system) to switch the context between kernel space and user space, or for other performance and security reasons. The industry has put in lots of effort to minimize the excess data copying, and many zero-copy solutions have become available.
Copying in the stream mode is tricky as there is no clear boundary between data chunks. In the stream mode, since buffer space is always limited, earlier data could be overwritten by later data. Reading from a stream could cause some previously read data to be flushed from memory—stream-mode reads could be destructive. Furthermore, if a piece of read data is to be used for another purpose, it has to be copied. However, copying is an expensive operation between memory units of disparate speeds. Copying between kernel space and user space is also expensive because of context switching.
Besides data copying, a content cache or a stream-mode data-transfer device should be further optimized to reduce processing latency related to hardware. Today, as the silicon clock speed has reached a plateau, most computers are built using MPSoC (multi-processor system on chip) devices. In such a system, there are multiple cores or CPUs (central processing units) on the same chip, while all CPUs share the same on-chip cache and the main memory.
As a rule, on-chip caches are organized according to decreasing speed and increasing size. If a CPU needs to access a block of data or instructions, it goes to the fastest L1 (level 1) cache first. If there is a cache miss, it has to go down in the memory hierarchy to L2-L4 (level 2-level 4) caches, which are progressively slower and increasingly larger in size.
As on-chip memory is shared, the control and management in a multi-core system is much more complicated than a single-processor system. For example, the same cache or memory location could be contended by multiple CPUs. In addition, the controller must maintain data consistency in the local caches of a shared resource. These and other issues make a cache miss very expensive in latency. To minimize latency, a new approach for writing computer codes on MPSoC devices has emerged. In the new approach, the priority is to maximize cache hits.
The demand for accelerating data transfer in the stream mode is strong and persistent in the Internet era. Therefore, there is a need to minimize data copying and the associated latency for applications requiring stream-mode data transfer, with possible caching, over multi-core computers.