The present invention relates to functions for distributing data traffic over a set of “bins,” and more specifically, to traffic distribution functions which employ hash functions to distribute data traffic among a set of ports or interfaces.
In many applications, packet-based switching devices (also referred to herein as switches) must statistically distribute traffic over a set of forwarding interfaces, ports, or bins in order to achieve a greater aggregate transmission bandwidth than a single interface can provide. This practice is known variously as “link aggregation,” “port trunking,” “port teaming,” or “load sharing.” The goal of these techniques is to aggregate N ports together in order to achieve N times as much transmit bandwidth than a single port provides. To achieve this, each packet that the switching device forwards must be mapped to one of N ports in a uniform manner, i.e., no one port can be systematically preferred over another.
The ideal method for guaranteeing uniform load balancing over the aggregated ports requires maintaining utilization state for each port. Packets can then be assigned to the least loaded port, thereby ensuring optimal uniformity. Unfortunately, this solution has large implementation costs and therefore is unsuitable for highly integrated switching devices.
On the other hand, the simplest method for guaranteeing uniform load balancing is to randomly assign each packet to an egress port. This solution has a very cheap implementation cost, but it violates important “flow ordering” constraints present in many applications. Such constraints require that packets sharing certain properties, e.g., as derived from their content, be forwarded along the same path through the network of switches.
The standard solution to this problem that is both cheap to implement and maintains flow ordering is to assign packets to egress ports based on the result of a “hash function” operation. A hash function maps an input “key” to an output hash value having fewer bits. The hash value is then mapped or “binned” using a binning function which maps the hash value to a port number between 0 and N−1.
Each packet's key is generated in such a way that two packets belonging to the same flow have the same key. For example, a simple definition of a flow depends only on the packet's source and destination addresses: (src_address, dst_address). In such a case the key would be constructed as a concatenation of these two fields. The definition of a flow may be refined further by including other properties of the packet such as, for example, addresses belonging to higher-layer protocols or quality-of-service classifications.
A good hash function for a high-performance, highly integrated switching device is characterized by good uniformity, small implementation area, and low computation time (i.e., low latency). Uniformity can be assessed by comparison to a random assignment. That is, if two randomly selected keys differ, then there should be a 50% probability that their hash values will be different. Hash functions have been proposed that provide very good uniformity when measured in this manner. However, few of these functions measure well on implementation area or latency. This is commonly the result of iterative properties inherent in the functions requiring that each byte of the input key be processed in a serial manner. In a high-performance hardware implementation, these iterations generally must be unrolled into unique logic structures. This leads to a large amount of area and a long computation time.
The generally recognized hash function suitable for high-performance, high-integration hardware implementations is the CRC, or Cyclic Redundancy Check. The CRC is commonly defined in an iterative manner, but in its unrolled form is equivalent to a tree of binary XOR operations over sets of input key bits. A generalization of the CRC that covers other (simpler) commonly used hardware hash functions is simply an XOR tree per hash value bit:
hash_value[i] = key[F[i,1]] {circumflex over ( )} key[F[i,2]] {circumflex over ( )} ... {circumflex over ( )} key[F[i,n_i]]where i=0 . . . M−1, and F[i,j] describes a set of n_i key fan-in bits per hash_value bit i. Each key bit F[i,j] is XOR-ed together (implemented as a tree structure for low area and latency) to produce hash_value[i].
Fewer implementation options are available for the binning stage that follows the hash function. Generally, one of two functions are used: Modulo or Division. When N is a power of two, these functions are essentially equivalent, i.e., they both represent taking the port number directly from a subset of the hash_value bits. For example, modulo binning over two ports represents assigning the egress port from hash_value[0]. Division binning in this example would assign the egress port from hash_value[M−1]. When N is not a power of two, a simple arithmetic calculation is performed.
Hash functions such as the CRC defined in terms of an XOR tree over key bits provide good uniformity when evaluated over random keys. However, when evaluated over real-world network packets, severe non-uniformity corner cases are sometimes seen. These arise because real-world keys are not distributed in a uniform, random manner. Commonly the addresses contained in the keys, e.g., MAC or IP addresses, are sequential in nature. Unfortunately, any hash function implemented as an XOR-tree over the key bits, followed by either modulo or division binning gives very bad uniformity when evaluated over such key sets. These non-uniformities are a significant problem for highly-integrated switches because they lead to a need for increased on-chip packet buffering, a scarce and expensive resource on such devices.
A software based algorithm known as Pearson's hash function has been shown to have better performance with regard to sequential key non-uniformity than a standard XOR-tree implementation. Pearson's algorithm employs a randomly initialized static mapping table to map each byte of each hash value to a new byte for a new hash value. However, while Pearson's approach has been shown to be effective in software solutions, implementing its iterative table lookup in highly integrated, high-performance hardware is problematic in terms of both area and latency.