Continuing advances in semiconductor technology and increasing levels of integration have allowed multiple processor cores to be integrated together onto a single integrated-circuit chip. Some applications can be divided into separate, relatively independent tasks that can be assigned to different processor cores on a multi-processor chip. Such multi-threaded applications can especially benefit from the processing power of multiple processors.
One application that benefits from multi-processing is the processing of packets in a network. A network node may receive thousands of packets in a short period of time. Multiple processors may each be assigned individual packets to process, such as for security, encryption, routing, and other network functions. The amount of network traffic that can be processed can scale with the number of processors.
A multi-processor system may have hundreds or more processors on one or more chips that can operate somewhat independently of one another. However, the packets from the network must be sent to each assigned processor, such as by initially writing the incoming packets to a shared memory. Since all incoming packets must pass through this shared memory, the shared memory can become a system bottleneck.
While multi-ported memory cells could be used to increase the bandwidth of the shared memory, such multi-port memory cells are much larger and more expensive than standard single-port memory cells. Another approach is to divide the shared memory into several banks. Each bank may be accessed separately, allowing different processors to access different banks at the same time. For example, 4 banks could allow four processors simultaneous access, while 16 banks could allow 16 processors simultaneous access.
While each memory bank could have a separate range of addresses, interleaving addresses among all the banks is often preferable. In word interleaving, each bank stores one multi-byte word. A sequence of successive words are written to successive banks. Incoming packets appear as a stream of words in a sequence of increasing addresses, and can be written into the shared memory as successive multi-byte words that are written to successive banks of memory.
Since the incoming packets are written to successive banks, the writes are spread out across all banks so that no one bank is overloaded with writes. Other processors can access one bank when the incoming-packet writes are being made to another bank.
FIG. 1 shows packets that have been written into a word-interleaved shared memory. In this simple example, there are 4 banks and the words are 4 bytes. Packet 1 is stored starting at address 1000 Hex, and has its first 4 bytes 0:3 stored in bank 0. The next word of bytes 4:7 are stored in bank 1; bytes 8:B are stored in bank 2, and bytes C:F are stored in bank 3. Successive 4-byte words 10:13, 14:17, 18:1B, 1C:1F are stored in the next row of banks 0, 1, 2, 3 as shown. The last bytes in the packet are bytes 7FC:7FF which are stored in bank 3.
Packet 1 can be written into the shared memory as a stream of words that are successively written to the four banks, and then to successive rows, until all of the packet is written.
The shared memory may be divided into pages of 2K bytes per page. The start of each packet may be aligned to the 2K page boundaries. Thus all packets would start at a multiple of 2K bytes. Packet 1 starts at address 1000 Hex, packet 2 starts at address 1800, packet 3 starts at address 2000, and packet 4 starts at address 2800, etc.
Network packets can have varying sizes. While 2K bytes may be the maximum packet size, smaller packets are common. For example, packet 2 is 1K bytes, packet 3 is 32 bytes, and packet 4 is only 16 bytes. Packets could have other sizes that are not powers of two, such as 5 bytes, 27 bytes, etc. When one packet is assigned for each 2K page, there is often wasted space at the end of the page since most packets are smaller than 2K bytes.
Aligning packets to pages in memory can have an unintended consequence. Packets typically start headers, which contain important information such as the size of the packet, a network protocol used by the packet, and status or control information. A packet processor may examine these header fields more often than other parts of the packet such as the data payload.
For example, each of the processors that are assigned to process packets may need to read the size field (SIZ) from the packet headers. Although each processor reads a different packet stored in a different location in the shared memory, the accesses may tend to be to the same bank in the shared memory because the size field tends to be the same number of bytes from the start of each packet, when the same network protocols are used by different packets.
As shown in FIG. 1, size field SIZ occurs in bytes 0:3 of each of packets 1, 2, 3, 4. Four separate processors examining these packets may need to read the size fields, requiring access to bank 0. Although the packets are stored across all four banks, bank 0 is likely to have a higher access frequency since the size field is more frequently read than other bytes in the packets. This is undesirable.
FIG. 2 highlights a delay to begin writing a packet to a multi-bank memory with fixed round-robin arbitration. A simple approach to accessing multiple banks of the shared memory is to use a fixed round-robin arbitration scheme. For 8 banks and 8 requesters, each bank may allow each grantor to access the bank only once in 8 periods or time-slots. For example, the incoming-packet interface may be able to write to bank 0 only once every 8 time-slots. The bank that may be written by the incoming packet interface for each time-slot is shown at the bottom of FIG. 2. For the first time-slot, bank 0 may be written, then bank 1 for the second time-slot, then bank 2 for the third time-slot, etc.
When the start of packets must be aligned to the 2K pages as shown in FIG. 1, then the first word of a new incoming packet may only be written to bank 0. The first word of the packet may not be written to any of the other banks 1:7 since the packet would not be page-aligned. Thus the first word in the incoming packet must wait until it can be written to bank 0.
In the example shown in FIG. 2, incoming packet 1 is received when the packet interface is allowed to access bank 1. The next time-slot for bank 0 is 6 time-slots later. Thus the incoming-packet interface must wait for an additional 6 time-slot periods before the first word can be written to the memory at bank 0. This additional delay is undesirable. Additional buffers such as FIFOs may be needed to temporarily store incoming packets during this packet-start delay. As delays accumulate, these buffers may overflow, causing data loss and requiring packet re-transmission.
While words from packet 1 could be written to the shared memory in an out-of-order fashion, this is undesirable since the packet is received as a stream in ascending word order. Other packets may also be delayed, either due to the delay in starting the write of packet 1, or by a delay in writing the start of the new packet. Thus the delays may be cumulative. Also, as more banks of memory are added, the number of time-slots between accesses to bank 0 may also increase. Thus attempts to increase memory bandwidth by increasing the number of banks may increase packet-start delays.
What is desired is a multi-processor system with a shared memory that is divided into interleaved banks. It is desired to stream incoming packets into the shared memory using fixed round-robin arbitration, but without long packet-start delays. A high-bandwidth shared packet memory is desirable that has frequently-accessed fields in the packet headers spread across all banks is also desirable.