One application area for servers is to provide streaming data over the Internet or over a corporate intranet. The streaming data may consist of multimedia data (e.g.—audio/video), character streams (e.g.—stock quotes), etc. On-demand content refers to the streaming of specific client-requested data files whereas broadcast content refers to content delivered from a source, such as an encoder system, onto an incoming network link, and streamed out to thousands of clients over a set of outgoing network links.
It is been found through measurements that currently used multiprocessor servers have poor SMP (Symmetric Multi-Processing) scalability. In most cases, the processor is the performance limiter. In such processor-bound cases using processors having 2 MB L2 caches, the following was noted:
A very high CPI (Clocks per Instruction retired) was noted with measured CPI ranges of between 4.0 and 6.0, which is considerably above the average for most server applications.
A very high L2 MPI (Level 2 Cache Misses per Instruction retired) was noted with measured L2 MPI's in the range of 2% to 4%. This indicates that on the average, 3 out of every 100 instructions results in an L2 miss.
A saturated front-side bus was noted. That is, performance counters show that the data bus is actively transferring data 40% of the time. When accounting for the read/write transaction mix, bus efficiency, and MP arbitration, this indicates that the front-side bus is close to being saturated.
The raw data bandwidth requirements for streaming are typically much lower than the capabilities of the system and the I/O buses. Thus, the observed saturation of the bus clearly indicates that there is a large overhead of unnecessary data transfers.
It has been found that poor scalability of such systems are due to the following factors:
Interrupt/DPC (Deferred Procedure Call) Migration: Hardware interrupts from the NIC (Network Interface Cards) are routed to any available processor by the OS (Operating System). The DPC handler, which is set up by the OS, is executed in turn by some other processor.
Loosely-coupled Connection Processing: Client connections are processed by different processors during connection lifetimes. The same processors process both input and output streams.
Thread and Buffer Migration: The threads of the server process are not bound to any specific processor. Thus, during the course of transferring data between input and output buffers, the server thread runs on different processors at different times. This migration of threads leads to the pollution of the processor caches, since the same buffer ping-pongs around between processors.
Inefficient L2 Cache Utilization: The large L2 processor caches are not properly utilized. The nature of streaming data is non-temporal, that is, the data is used only once and never used again, thus, the caching of this data serves no useful purpose and yet the data is loaded into the L2 caches. Since the threads write to the data buffers in order to extract/append network protocol headers, this leads to dirty write-backs when the next incoming buffer is accessed by a processor.