The communications industry is rapidly changing to adjust to emerging technologies and ever increasing customer demand. This customer demand for new applications and increased performance of existing applications is driving communications network and system providers to employ networks and systems having greater speed and capacity (e.g., greater bandwidth). In trying to achieve these goals, a common approach taken by many communications providers is to use packet switching technology. Increasingly, public and private communications networks are being built and expanded using various packet technologies, such as Internet Protocol (IP).
In networking devices, it is important to maintain accurate packet and byte count statistics for all traffic flowing through the device. Such statistics are important for customers, for lab testing, and also for verification and debug. Generally, statistics must be maintained for a large number of items in a few different categories (e.g., individual routes the packets are taking, the adjacencies (next hops) of the packets, etc.). It is not unusual for a core router to need to maintain statistics on packets arriving at a rate of 50M packets per second (PPS), and to have to support 1M routes (1M different sets of packet and byte counters in the route category).
Maintaining accurate statistics is so important that many systems will “back-pressure” (artificially restrict the rate of incoming traffic) rather than lose any counter updates in the event that they cannot keep up with the counter-update rate. A very big issue in maintaining these counters is providing the necessary combination of storage space and bandwidth in a fashion that is cost effective, low in power, and low in pin count. Complicating the bandwidth issue is that, as the number of counters grows, the frequency at which S/W can read an individual counter lessens. To prevent counter overflow, the counters must be made large enough so that they will not overflow in the time it takes software to service (read) all the counters. In a system with 1M counters, it is not unreasonable for S/W to read all the counters every 10 sec. This implies that the size of the counters must be chosen so that they do not overflow in at least 10 sec. Reasonable counter sizes for a 50M PPS arrival rate are at least 32 bits for packet counters, and at least 38 bits for byte counters. The size and number of counters make storing the full counters directly on a packet-switching chip impractical with today's technologies.
There is an additional advantage to storing the counters in a large off-chip memory—if the counters can be made sufficiently large (e.g., approximately 56 bits), then they will overflow so infrequently (tens of years) that S/W does not have to periodically read them to prevent overflow. Large enough counters will only have to be read when it is desired to gather the statistics from them.
At a peak rate of 50M counter updates per second (cups), and using 128 bits to store both the byte and packet counters for one item, the bandwidth required for counter updates (which must read the old counter value and then write back an updated value) is approximately 12.8 Gbps (and this ignores overheads due to CPU access to the counters and to refresh cycles for DRAM based solutions). This data bandwidth could be achieved by a 64-bit wide Reduced Latency Dynamic Random Access Memory (RLDRAM) at 200 MHz, with appropriate attention to pipelining and bank conflicts. But this would only support one category of counter, and typical implementations have two or three categories they try to maintain. As can be seen, the cost (in terms of board space, power budget, and dollars) of implementing this large a number of counters off-chip at the necessary throughput rate can be very high.
The large number of packet and byte counters required necessitates some type of RAM-based storage for the counters. Updating a RAM-based counter, however, involves doing a read-modify-write (RMW) open—the previous contents of the RAM must be read, the contents must be updated, and the new contents must be written back. An RMW open is well-known in the art, but it does require more bandwidth since the RAM must be accessed twice for each counter update.
Previous solutions to the counter update problem have used very expensive, high-bandwidth off-chip RAM-based solutions (high-speed DDR or QDR SRAMs or DDR SDRAM) that can keep up with the worst-case counter-update bandwidth requirements. If the packet arrival rate is 50M PPS, then the counter-update rate (for one type of counter) is 50M counter updates per second (cups) in the worst case. And on top of this, some bandwidth is necessary for CPU activity (to read the counters), and for refresh (for DRAM-based solutions).
Some solutions to this problem have used FIFOs to compensate for reduced bandwidth and/or CPU activity (which can delay counter updates). These solutions have generally just used the FIFO as a buffer—the off-chip RAM is still designed for the worst-case bandwidth. (Typically, such FIFOs can hold no more than a few thousand entries, much smaller than the number of items.)
Another technique that has been used is to build two-level counters, where the least-significant bits (LSBs) and the most-significant bits (MSBs) are maintained separately. This can save bandwidth by only having to reference the MSBs when the LSBs overflow, instead of on every counter update.
Some aspects of the counter update problem are described in the article: Devavrat Shah et al., Maintaining Statistics Counters in Router Line Cards, IEEE Micro, Jan.-February 2002, pp. 76-81, which is hereby incorporated by reference. Shah et al. describe a theoretical approach, and a largest-counter-first counter management algorithm (LCF CMA) that selects a counter with the largest count to update to a secondary memory. This requires that some mechanism be employed to maintain counters in a sorted order or to quickly determine the largest counter. Shah et al. admit that their “LCF CMA is a complex algorithm that is hard to implement at a very high speed. It would be interesting to obtain a similar performance as LCF CMA with a less complex algorithm.” Id. at 80-81.