1. Field of the Invention (Technical Field)
The present invention relates to atomic operation units (AOUs) for network interface controllers (NICs).
2. Description of Related Art
Note that the following discussion refers to a number of publications by author(s) and year of publication, and that due to recent publication dates certain publications are not to be considered as prior art vis-a-vis the present invention. Discussion of such publications herein is given for more complete background and is not to be construed as an admission that such publications are prior art for patentability determination purposes.
A key capability for parallel computers, particularly those supporting partitioned global address space (PGAS) programming models, is the ability to efficiently support remote atomic operations. A common usage model for remote atomic operations is to have many nodes access a small number of variables on a given target node. A unit capable of performing atomic operations is sometimes provided on the network interface along with a local cache of data. The local cache on the network interface controller (NIC) poses a set of challenges regarding the frequency with which items are propagated to the node's primary memory. The present invention provides a mechanism for managing this cache along with mechanisms to reduce data traffic to the host processor.
Atomic operations have been supported on the network interface for quite some time with the restriction that the data item only be modified through a specific Application Programming Interface (API). For example, the Quadrics Elan network adapters, J. Beecroft, et al., “Meiko CS-2 interconnect Elan-Elite design”, Parallel Computing, 20(1011):1627-1638 (1994); and F. Petrini, et al., “The Quadrics network: High-performance clustering technology”, IEEE Micro, 22(1):46-57 (January 2002), support SHMEM, Cray Research, Inc., SHMEM Technical Note for C, SG-2516 2.3 (October 1994), perform atomic operations using an Elan thread. A similar scheme was provided on the Cray T3E, S. L. Scott, “Synchronization and communication in the T3E multiprocessor”, Seventh ACM International Conference on Architectural Support for Programming Languages and Operating Systems (October 1996), but was provided at the memory controller, where it is easier to guarantee ordering semantics, is always visible to the processor, and does not consume system bus bandwidth to flush an item. While placing the operations at the memory controller is quite appealing technically, it is generally less feasible in modern system implementations where the memory controller is part of the host processor chip.
Upcoming networks by supercomputer vendors may support SHMEM style atomics with an atomic unit on the network interface along with a local cache. However, none of these adapters are believed to include a write-through cache or a local tracking of outstanding items evicted from the local cache. More importantly, these designs likely use time-outs to mitigate the amount of traffic placed on the interconnect to the host processor rather than a more flexible rate absorbing mechanism.
Collective operations are closely related to atomic operations and have been studied on programmable network interfaces (e.g., D. Buntinas, et al., “NIC-based reduction in Myrinet clusters: Is it beneficial?”, Proceedings of the SAN-02 Workshop (February 2002); A. Moody, et al., “Scalable NIC-based reduction on large-scale clusters”, Proceedings of the ACM/IEEE SC2003 Conference (November 2003); however, collectives are fundamentally different in the way they accept data and provide results.
Previous designs have attempted to implement atomic operations on the network interface using a local cache. One of the fundamental problems, however, is that the access mechanisms for variables touched by the atomic operations are sub-optimal. In general, previous designs have used a time-out to manage the local cache. This time-out allows the cache to update the host memory after a predefined interval, but brings a certain set of constraints on performance. For example, one usage of atomic operations is to allow the local host to track “completion events”. These events can be signaled by atomically incrementing a variable, with the host waiting for a certain value of the variable to be reached before proceeding. “Waiting” typically consists of polling the location of the atomic element in host memory and having that value be updated as quickly as possible. It is generally desirable to relax these constraints by increasing the frequency with which updates are written to host memory; however, doing so could easily overwhelm the link between the network interface and the host processor.
The fundamental limitation associated with time-outs for moving data from NIC cache to host memory is the specific time-out value that is chosen. If the time-out value is too large, a significant performance penalty is incurred because the host has to wait for an extended period of time to determine that the value has been updated. If the time-out value is too small, it loses its impact because it no longer reduces traffic to the host.
At this point, it is useful to consider traffic models for atomic operations. There are three basic points in the spectrum to consider. The first is a light traffic model, where some number of locations are modified atomically “occasionally”. Virtually any atomic unit is sufficient for this class of operation as it happens seldom enough to have minimal impact on performance. The second is “global random access” traffic as might be seen in the GUPS (Giga-Updates per Second) benchmark. In this case, regardless of access rate, caches have no value as the operation never hits cache. These cases require that the functional unit and cache operate efficiently in high miss rate scenarios. The third case, and interesting case for the discussion of bandwidth mitigation, is one where a small number of variables are heavily accessed through atomic operations at a given node. This type of access occurs frequently when managing locks or structures such as shared queues. What is unique about the third case is that it can generate a large amount of traffic to the host memory that can be mitigated by caching on the network interface.
The mechanism provided by the present invention uses a write-through cache combined with traffic mitigation at both the atomic unit as well as the queue between the atomic unit and the host processor. It also provides appropriate mechanisms for tracking “in flight” operations as necessary. Together, these optimizations significantly enhance performance for atomic operations.