One architecture, depicted in FIG. 1, that is commonly used for network devices is a single device interface including a centralized shared memory scheme, where a CPU and network device communicate via a shared memory. Various schemes are used to manage this interface, with a very common and popular scheme being a set of ring descriptors containing the common data, where a device will use descriptor rings accessed by both the device and CPU. For example in the receive mode, these descriptor rings work by the CPU initializing each entry with a buffer address/length, and setting a device OWN bit.
The device will be polling the descriptor ring, and when a packet is received, the next descriptor is used to indicate where to put the data. After a block of data is received, the descriptor is updated by the device with a received length and the OWN bit cleared. The CPU will check the descriptor entry (maybe driven by an interrupt) and if the OWN bit is clear, use the status value stored in the descriptor to process the received buffer. Usually, the CPU keeps a shadow copy of the descriptor ring to hold management information, such as the metadata of the buffers.
The transmit side is usually similar to the receive side, except that the CPU has to revisit the ring to process the completed transmission of the packet.
This scheme has been used for many devices, but has some drawbacks:
                Because the OWN bit is used to indicate ring entry ownership, the CPU and device have to read and write the same memory. Thus, the memory cannot be cached by the CPU without performing an invalidate for every access to the ring.        When the descriptor is read by the CPU, it contains data that is not used by the CPU, i.e., the buffer address and length. Thus the amount of data to be read by the CPU in processing the ring is larger than it need be.        To reduce the cost of reading the ring, sometimes the CPU can access the descriptors via a cached view of the memory, and also prefetch the data. By having larger descriptors, the number of descriptors read in each cache line is reduced.        Because each descriptor is based on a scatter/gather buffer, several ring descriptor entries may have to be processed for each packet.        Often the receive and transmit rings have similar descriptors, but this does not need to be the case.        
Whilst this has been in common use, it is not an optimal scheme, especially as network interfaces have become faster, and CPUs are not keeping up with the processing of packets using a standard ring descriptor scheme. Another factor is that more modern CPUs generally have caches that allow data to be processed in chunks, support write posting (where I/O writes proceed without stalling the CPU), and cache prefetching, which allows early prefetch of data without having to stall the CPU. Some newer CPUs are even I/O cache coherent, which means that if a device is accessing the same memory as the CPU, this causes the corresponding cache lines to be automatically invalidated in the CPU.
Another issue is avoiding one fast interface from monopolizing all available resources when only a single interface is used to multiplex tx/rx (transmit/receive) streams from multiple line cards and interfaces.
Accordingly, a shared-memory scheme that utilizes CPU resources in a more efficient way and that can avoid interface monopolizing is required.
Another bottleneck with existing systems is that a typical packet transmission involves a CPU writing one or more buffer addresses and control information into one or more transmit descriptors of the controller chip. After notification of the new data, the controller chip reads the data from the buffer and transmits on the egress path. After transmission the controller updates the descriptor to indicate completion of the operation to the CPU which cleans up and prepares for the next transmission.
A system, designed by the assignee of the present application, utilizes a Broadcom 1250 CPU connected to an HT-FPGA (Field Programmable Gate Array) over the HyperTransport® (HT) bus as one of its egress paths. The HT-FPGA connects the line cards to the HT interface of the processor. It is responsible for delivering the packets from the line cards into the processor packet memory in the ingress direction and pulling the packets off the memory and transmitting them to the line cards in the egress direction. It is also responsible for handling the line card egress flow control.
A transmit descriptor ring is used to pass packet buffers for transmission by the HT-FPGA. The HT-FPGA reads the descriptor ring to get the pointer to the buffer for transmission then programs CPU resources to perform the data reads and sends on the egress path. Data transmission and manipulation of the descriptor ring has to be done over the HT bus. The HT bus is very inefficient on read operations but is very efficient on write operations.
The inefficiency of read operations on the HT bus is related to the following: 1) that memory accesses have to be tightly coupled to the HT read command from an external device, and 2) the number of outstanding HT transactions supported at any given time is limited.
Having memory accesses tightly coupled to the HT read command reduces the efficiency of read operations because the device must wait for a response from the target device causing increased latency, and since the number of transactions supported is limited, and HT read transactions can only be issued as long as support for the transaction is available.
On the other hand HT write operations are very efficient for several reasons: HT writes are posted and the writes can be overlapped for efficient pipelining, more HT write transactions can be outstanding than HT read transactions at any time due to the amount of required logic to support read transactions, and, having more outstanding transactions to be in process at any time causes the HT write transactions to be more efficient than HT read transactions due to the increased pipelining.
One approach to taking advantage of the write efficiency of the HT bus is to write the transmit directly to the HT-FPGA. This can be done in two ways.
A first approach is where the CPU writes the data directly to the HT-FPGA. This approach is very CPU intensive and not desirable as the CPU is not doing useful work. Also there is a limit to the number of writes that can be posted by the CPU.
A second approach is to use a DMA (Direct Memory Access) engine to transfer the data to the HT-FPGA over the HT bus. Even though the second approach is faster, the CPU still has to program the data mover and maintain its descriptor rings. Programming the data mover involves device write cycles which are slower than the memory write cycles which are cached and posted and also involves handling of an extra interrupt coming from the data mover which wastes time in a context switch.
Accordingly improved techniques are needed for transmitting packet data without wasting precious CPU cycles.
Another waste of precious CPU cycles occurs during a typical high level packet flow control process, which involves the CPU receiving xon/xoff flow control status from an interface by using either polling, interrupt, or event messaging techniques. The CPU then writes the information to an xon/xoff table for use by the software packet transmit routine.
Each entry in the xon/xoff table represents the packet transmit status for a specific interface. A transmit routine checks the xon/xoff table for a specific interface prior to packet transmission. If the interface table entry indicates an xon status the packet is transmitted. If the interface table entry indicates an xoff status then the packet is placed in a holding queue until the interface entry status has been updated to indicate an xon status. When transitioning an interface table entry from an xoff to xon status, the CPU must check if packets are in the holding queue awaiting transmission to the interface. The packets are then placed back on the transmit queue to be transmitted to the interface.
This typical packet flow control technique is CPU intensive and requires many CPU cycles to implement the polling and updating of the xon/xoff table entries. A technique that utilizes fewer precious CPU cycles would be valuable.