1. Field of the Invention (Technical Field)
The present invention relates to network interface controllers and apparatuses and methods to increase message throughput therethrough.
2. Description of Related Art
Note that the following discussion refers to a number of publications by author(s) and year of publication, and that due to recent publication dates certain publications are not to be considered as prior art vis-a-vis the present invention. Discussion of such publications herein is given for more complete background and is not to be construed as an admission that such publications are prior art for patentability determination purposes.
Network bandwidths are continuing to rise rapidly; however, the clock rate of network interface controllers (NICs) is not keeping up. This implies that NICs must increase the amount of parallelism realized to achieve increases in message rate that even match the increase in bandwidth. To compound the problem, the advent of multi-core and many-core processors has stopped the growth in message sizes. This requires additional improvements in the network interface simply to match the improvement in network bandwidth. While multiple NICs have in the past led to significant challenges in implementing multi-rail systems, the present invention uses multiple “cores” within a NIC to enhance message rate without the challenges associated with multi-rail systems.
There has been a recent emphasis on improving message rates for Message Passing Interface (MPI). InfiniPath, L. Dickman, et al., “PathScale InfiniPath: A first look”, Proceedings of the 13th Symposium on High Performance Interconnects (August 2005), from Pathscale (now QLogic, Inc.) was the first of a generation of so-called on-load engines that move all responsibility for message processing to the host microprocessor. InfiniPath demonstrated (and touted the fact that) an onload approach could achieve extremely high message rates for MPI under a certain set of scenarios. Recently, Mellanox has released a new series of network interfaces that performs most message processing on the host processor and also touts extremely high MPI message rates.
Studies at Sandia National Laboratories (SNL), however, indicate that the microbenchmarks used to make many of the claims regarding the onload approach have little relationship to realistic scenarios. K. Underwood, “Challenges and issues in benchmarking MPI”, Recent Advances in Parallel Virtual Machine and Message Passing Interface: 13th European PVM/MPI Users' Group Meeting, Bonn, Germany (September 2006), Proceedings, volume 4192 of Lecture Notes in Computer Science, pages 339-346 (Springer-Verlag, 2006); and K. D. Underwood et al., “The impact of MPI queue usage on message latency”, Proceedings of the International Conference on Parallel Processing (ICPP), Montreal, Canada (August 2004). Indeed, it is often the case that performance enabling features of the network interface, rather than microbenchmarks, have more impact on performance. R. Brightwell, et al., “A comparison of 4× InfiniBand and Quadrics Elan-4 technologies”, Proceedings of the 2004 International Conference on Cluster Computing (September 2004); and R. Brightwell et al., “An analysis of the impact of overlap and independent progress for MPI”, Proceedings of the 2004 International Conference on Supercomputing, St. Malo, France (June 2004). Recent work at SNL has also demonstrated that offload approaches can achieve high message rates, A. Rodrigues, et al., “Enhancing NIC performance for MPI using processing-in-memory”, Proceedings of the 2005 Workshop on Communication Architectures for Clusters (April 2005); K. D. Underwood, et al., “A hardware acceleration unit for MPI queue processing”, 19th International Parallel and Distributed Processing Symposium (April 2005); and K. D. Underwood, et al., “Accelerating list management for MPI”, Proceedings of the 2005 IEEE International Conference on Cluster Computing (September 2005), while also delivering key network features such as overlap and independent progress.
The impending challenge is two-fold. The scaling of clock rates for standard cell Very Large Scale Integration (VLSI) design has drastically slowed. Thus, offload approaches can no longer linearly increase their performance through increasing clock rate. Instead, they must exploit more parallelism. Onload approaches will face a similar problem. Memory latencies are approaching a floor and the processing of MPI messages is highly sensitive to memory latency. Thus, new solutions to improving message rate are needed.
Quadrics, in the Elan5 adapter, J. Beecroft, et al., “The Elan5 network processor”, Proceedings of the International Supercomputing Conference 2007, Dresden, Germany (June 2007), chose a parallel processing approach. In the Elan5 approach, several specialized RISC processors share all resources, including all memories. The seven processors are dedicated to a variety of tasks in a largely pipelined model with two input processors, one management processor, and four processors dedicated to output and general purpose processing.
When there are a small number of host processes, it is necessary to arrange the processing elements such that all of the processing elements can be dedicated to servicing messaging for a single host process. This is the approach that Quadrics chose; however, such a parallel processing approach imposes a number of challenges. Foremost, access to most resources must be arbitrated among the various processors. Among these, arbitrating access to the memory that holds such data structures as the MPI posted receive queue is an overhead that is imposed unnecessarily and that can cause performance degradation due to contention for memory ports. The use of multiple NIC cores according to the invention eliminates this issue by replicating fully functional NIC cores (as opposed to just the processing logic within the NIC core) and appropriately routing traffic to each. This achieves better scalability in terms of both performance and silicon area by distinguishing resources which must be replicated, resources which can be shared without contention issues, and resources which can be partitioned based on the operational model.