Typically, in designs for networking adapters, challenges are encountered where a partial word (e.g., 16 bit of IP checksum) has to be inserted into packets in buffers that are typically aligned to bus widths (e.g., 64 bit as in the case of 8× PCI Express interface; an example of such an interface can be found at www.pcisig.com, PCI Express Base spec. Rev 1.0a). In fact, this is frequently required in hardware logic that implements a “checksum offload” feature. In many conventional designs (e.g., U.S. Pat. No. 5,898,713, “IP Checksum Offload”, Melzer et al., wherein IP checksum computation is offloaded to a control unit to reduce processor cycles consumed by the host, thereby improving the performance of the host computer and network), the hardware logic is required to insert the partial word into any specified offset into the packet; this insert position in the buffers could be odd or even.
A conventional method for undertaking such a partial word write involves using a shifter that employs 16 1:8 demultiplexers with lower order 3 bits of the offset (i.e., the least significant 3 bits of the specified checksum position within the packet, e.g., chksum_pos(2:0)) acting as the “select” lines which determine the amount of shift. The rest of the higher order bits of the offset act as an address into the buffer being written with byte enables. One problem with this method is that it is highly logic-intensive and also reduces the frequency of operation, since demultiplexers are inserted right in the critical data path. Further, in the absence of byte enable at the buffer interface, the design will require a read-modify-store approach; this will further increase the latency.
Another key drawback of the above-noted conventional approach is that in case the checksum has to be inserted at the buffer word boundary (e.g. checksum Position=7), the word to be written in the packet buffer has to be computed separately (i.e., the lower checksum byte is written at byte 7 at a word address, and in the next cycle the upper checksum byte is written at byte 0 of the next word address). Accordingly, the logic needs a separate multiplexer to select data for these two write cycles.
High Performance Computing (HPC) networking adapters which currently exist tend to require hardware to perform operations with low latency, and with less consumption of logic cells at high frequency. Such optimization is particularly important for FPGA implementation, where a frequency of operation of 250 MHz is typical for supporting support high throughput requirements of a GigaEthernet interface. Simply, conventional arrangements in the mold discussed above are not adequate to respond to such demands. Accordingly, a strong and compelling need has been recognized in connection with improving upon the performance of conventional arrangements and implementing a system that can meet demands of the type just described.