This invention relates to a network interface device including an accelerator unit and a data processing system comprising such a network interface device.
Achieving the right balance between the functionality/performance of a network interface device and power/cost considerations has long been the subject of debate, particularly in terms of the choice as to which aspects of the communication and other protocols that might operate over the device should be accelerated in hardware at the network interface device. Such acceleration functions are referred to as “offloads” because they offload processing that would otherwise be performed at the CPU of the host system onto the network interface device.
Usually the offload is chosen to be a specific function of the network protocol stack that is amenable to hardware acceleration. Typically, this includes the data integrity aspects of a protocol such as TCP/IP checksums, iSCSI CRC digests, or hashing or lookup operations such as the parsing of data flows onto virtual interface endpoints. Whether or not a particular function of a network protocol is amenable to hardware acceleration depends on several factors, which will now be discussed.
Whether or not a function may be performed based solely on the contents of an individual network packet. This property is termed ‘stateless’ when applied to an offload. A stateless offload requires little local storage at the network interface—for example, TCP/IP checksum insertion on transmission requires buffering of a single Ethernet frame. In contrast, a statefull operation may require the interface to store state relative to a large number of network flows over a large number of network packets. For example, an Ethernet device that performs reassembly of TCP/IP flows into units which are larger than the MSS (Maximum Segmentation Size) would be required to track many thousands of packet headers. Statefull protocol offloads can therefore require the network interface to have significant amounts of fast memory which is both expensive and power hungry.
Whether or not a function may be directly implemented in parallel logic operating over a single or small number of passes of the data contained within the network packet. This property is termed tractable. For example, the AES GCM cryptographic algorithm has been designed such that the internal feedback loop may be ‘unrolled’ when implemented. This enables a hardware designer to scale an AES GCM engine's performance (bandwidth) by simply adding more gates in silicon, which by Moore's Law can be readily accommodated as higher speeds are required. In contrast, the Triple-DES cryptographic algorithm may not be unrolled into parallel hardware. This requires an implementation to iterate repeatedly over the data. In order to improve the performance of an iterative algorithm, the implementation must scale in clock frequency, which is becoming increasingly difficult on silicon based processes. Being untractable, iterative algorithms are more difficult to implement as hardware offloads.
Whether or not a protocol function has been designed for hardware execution. Generally, the specification of a hardware protocol will be unambiguous and strictly versioned. For example, Ethernet line encodings are negotiated at link bring up time and, once settled upon, are strictly adhered to. Changing encoding requires a re-negotiation. By contrast, the TCP protocol that has not been specifically designed for execution at hardware is specified by many 10s of RFCs (Request For Comments). These specifications often present alternative behaviours, and are sometimes conflicting, but together define the behaviour of a TCP endpoint. A very basic TCP implementation could be made through adherence to a small number of the RFCs, but such a basic implementation would not be expected to perform well under challenging network conditions. More advanced implementations of the TCP protocol require adherence to a much larger number of the RFCs, some of which specify complex responses or algorithms that are to operate on the same wire protocol and that would be difficult to implement in hardware. Software-oriented specifications are also often in a state of continued development, which is sometimes achieved without strict versioning. As such, software-oriented specifications are usually best expressed in high level programming languages such as C, which cannot be easily parallelized and converted to hardware logic representation.
Whether or not a function is well known and commonly used enough for it to be considered for implementation in a commercial network interface device. Often, application specific functions (such as normalisation of stock exchange data feeds) are only known to practitioners of their field and are not widely used outside of a few companies or institutions. Since the cost of implementing a function in silicon is tremendously expensive, it might not be commercially viable to implement in hardware those functions whose use is limited to a small field.
In summary, features that are typically chosen to be implemented as offloads in hardware are those which are stateless, tractable, hardware oriented, well known and commonly used.
Unfortunately, there are number of functions which do not meet these criteria and yet being performance-sensitive greatly benefit from being accelerated in hardware offloads. For example, in the Financial Services sector it is often the case that large numbers of data feeds must be aggregated together and normalized into a unified data model. This normalisation process would typically unify the feed data into a database by, for example, time representation or stock symbol representation, which would require hundreds of megabytes of data storage to implement in hardware. Other niche application spaces that greatly benefit from being accelerated in hardware offloads include: event monitoring equipment in high energy particle colliders, digital audio/video processing applications, and in-line cryptographic applications.
Often the hardware suitable for accelerating protocol functions in such niche application spaces does not exist because it is simply not commercially viable to develop. In other cases, bespoke network interface hardware has been developed which implement the application specific offloads required but at significant cost, such as with the Netronome Network Flow Engine NFE-3240. Additionally, many bespoke hardware platforms lag significantly behind the performance of commodity silicon. For instance, 40 Gb/s Ethernet NICs are now available and the shift to 100 Gb/s commodity products is quickly approaching, yet most bespoke NICs based upon an FPGA are only capable of 1 Gb/s.
To give an example, the hardware offloads for a normalisation process in the Financial Services sector would typically be implemented at a NIC based upon an FPGA (Field-Programmable Gate Array) controller that includes the features of a regular network interface as well as the custom offloads. This requires the FPGA controller to define, for instance, the Ethernet MACs and PCle core, as well as the custom offload engines and would typically be provided with a set of bespoke drivers that provide a host system with access to the hardware offloads of the FPGA. This implementation strategy is problematic because the speed and quality of FPGA chips for NICs is not keeping pace with the innovation of commodity NICs that use application specific integrated circuits (ASICs). In fact, the design and implementation of the PCIe core is often the rate determining factor in bringing a custom controller to market and FPGA vendors typically lag the commodity silicon designs by a year.
Furthermore, the problem is becoming more acute as systems become more integrated and demand that NICs offer more commodity features such as receive-side scaling (RSS), support for multiple operating systems, network boot functions, sideband management, and virtualisation acceleration (such as the hardware virtualisation support offered by the PCI-SIG I/O Virtualisation standards). This is being driven by the increasing use of virtualisation in server environments and data centres, and, in particular, the increasing use of highly modular blade servers.
A data processing system 100 is shown in FIG. 1 of the type that might be used in the Financial Services sector to provide hardware accelerated normalisation of certain data feds. The data processing system 100 includes a bespoke network interface device (NIC) 101 coupled to a host system 102 over communications bus 103. NIC 101 has two physical Ethernet ports 104 and 105 connected to networks 106 and 107, respectively (networks 106 and 107 could be the same network). The bespoke NIC 101 is based around an FPGA controller 108 that provides offloads 109 and 110 in hardware. The offloads could, for example, perform normalisation of data feeds received at one or both of ports 104 and 105. Typically the NIC will also include a large amount of high speed memory 111 in which the data processed by the hardware offloads can be stored for querying by software entities running at host system 102.
Generally, host system 102 will have an operating system that includes a kernel mode driver 112 for the bespoke NIC 101, and a plurality of driver libraries 115 by means of which other software 116 at user level 114 is configured to communicate with the NIC 101. The driver libraries could be in the kernel 113 or at user level 114. In the case of a host system in the Financial Services sector, software 116 might be bank software that includes a set of proprietary trading algorithms that trade on the basis of data generated by the offloads 109 and 110 and stored at memory 111. For example, memory 111 could include a database of normalised stock values, the normalisation having been performed by the offloads 109 and 110 in accordance with known database normalisation methods. Typically, host system 102 will also include management software 117 by means of which the NIC can be managed.
Since NIC 101 provides a customised function set, the vendor of the NIC will provide the driver and driver libraries so as to allow the software 116 to make use of the custom functions of the NIC. Any software running at user level on the host system must therefore trust the vendor and the integrity of the driver and driver libraries it provides. This can be a major risk if the software 116 includes proprietary algorithms or data models that are valuable to the owner of the data processing system. For example, the data processing system could be a server of a bank at which high frequency trading software 116 is running that includes very valuable trading algorithms, the trades being performed at an exchange remotely accessible to the software over network 106 or 107 by means of NIC 101. Since all data transmitted to and from the host system over the NIC traverses the kernel mode vendor driver 112 and vendor libraries 115, the software 116 including its trading algorithms are accessible to malicious or buggy code provided by the NIC vendor. It would be an onerous job for the bank to check all the code provided by the NIC vendor, particularly since the drivers are likely to be regularly updated as bugs are found and updates to the functionality of the NIC are implemented. Furthermore, a NIC vendor may require that a network flow is established between the management software of the NIC 117 to the NIC vendor's own data centres. For example, this can be the case if the NIC is a specialised market data delivery accelerator and the market data is being aggregated from multiple exchanges at the vendor's data centers. With the structure shown in FIG. 1, the bank would not be able to prevent or detect the NIC vendor receiving proprietary information associated with software 116.
Financial institutions and other users of bespoke NICs that need to make use of hardware offloads are therefore currently left with no choice but to operate NICs that offer a level of performance behind that available in a commodity NIC and to trust any privileged code provided by the NIC vendor that is required for operation of the NIC.
There have been efforts to arrange network interface devices to utilise the processing power of a GPGPU (General Purpose GPU) provided at a peripheral card of a data processing system. For example, an Infiniband NIC can be configured to make peer-to-peer transfers with a GPGPU, as announced in the press release found at:
http (colon slash slash) gpgpu (dot) org/2009/11/25/nvidia-tesla-mellanox-infiniband
and the Nvidia GPUDirect technology is described at:
http (colon slash slash) www (dot) mellanox.com/pdf/whitepapers/TB_GPU_Direct (dot) pdf. Both of these documents are incorporated herein by reference for their teachings.
However, despite offering acceleration for particular kinds of operations (such as floating point calculations), GPGPUs are not adapted for many kinds of operations for which hardware acceleration would be advantageous. For example, a GPGPU would not be efficient at performing the normalisation operations described in the above example. Furthermore, in order for a NIC to make use of a GPGPU, the NIC typically requires an appropriately configured kernel-mode driver and such an arrangement therefore suffers from the security problems identified above.
Other publications that relate to memory-mapped data transfer between peripheral cards include “Remoting Peripherals using Memory-Mapped Networks” by S. J. Hodges et al. of the Olivetti and Oracle Research Laboratory, Cambridge University Engineering Department (a copy of the paper is available at http (colon slash slash) www (dot) cl (dot) cam (dot) ac (dot) uk/research/dtg/www/pubcations/public/files/tr.98.6.pdf), and “Enhancing Distributed Systems with Low-Latency Networking”, by S. L. Pope et al. of the Olivetti and Oracle Research Laboratory, Cambridge University Engineering Department (a copy of the paper is available at http (colon slash slash) www (dot) cl (dot) cam (dot) ac (dot) uk/research/dtg/www/publications/public/files/tr.98.7.pdf). Both of these documents are incorporated herein by reference for their teachings.
There is therefore a need for an improved network interface device that provides a high performance architecture for custom hardware offloads and an secure arrangement for a data processing system having a network interface device that includes custom hardware offloads.