1. Field of the Invention
This invention generally relates to computer processing and, more particularly, to a system and method for equally distributing a packet processing load among parallel processors.
2. Description of the Related Art
As noted in Wikipedia, direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit (CPU). Many hardware systems use DMAs, including disk drive controllers, graphics cards, network cards and sound cards. DMA is also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-on-chips (SoCs), where its processing element is equipped with a local memory (often called scratchpad memory) and DMA is used for transferring data between the local memory and the main memory. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Similarly a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, thus permitting computation and data transfer concurrency.
Without DMA, using programmed input/output (PIO) mode for communication with peripheral devices, or load/store instructions in the case of multicore chips, the CPU is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU can initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is especially useful in real-time computing applications where not stalling behind concurrent operations is critical. Another and related application area is various forms of stream processing where it is essential to have data processing and transfer in parallel, in order to achieve sufficient throughput.
A DMA transfer copies a block of memory from one device to another. While the CPU initiates the transfer by issuing a DMA command, it does not execute it. For so-called “third party” DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. In an embedded processor or multiprocessor system-on-chip, it is a DMA engine connected to the on-chip bus that actually administers the transfer of the data, in coordination with the flow control mechanisms of the on-chip bus.
A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation usually does not stall the processor, which as a result can be scheduled to perform other tasks unless those tasks include a read from or write to memory. DMA is essential to high performance embedded systems. It is also essential in providing so-called zero-copy implementations of peripheral device drivers as well as functionalities such as network packet routing, audio playback, and streaming video. Multicore embedded processors (in the form of multiprocessor system-on-chip) often use one or more DMA engines in combination with scratchpad memories for both increased efficiency and lower power consumption. In computer clusters for high-performance computing, DMA among multiple computing nodes is often used under the name of remote DMA.
A general purpose programmable DMA controller is a software-managed programmable peripheral block charged with moving or copying data from one memory address to another memory address. The DMA controller provides a more efficient mechanism to perform large data block transfers, as compared to a conventional general purpose microprocessor. The employment of DMA controllers frees up the processor and software to perform other operations in parallel. Instruction sequences for the DMA, often referred to as control descriptors (CDs or descriptors), are set up by software and usually include a source address, destination address, and other relevant transaction information. A DMA controller may perform other functions such as data manipulations or calculations.
Control descriptors are often assembled in groups called descriptor sequences or rings. Typically, the software control of a DMA controller is enabled through a device specific driver. The device driver is responsible for low level handshaking between upper layer software and the hardware. This device driver manages the descriptor rings, communicates with the DMA controller when work is pending, and communicates with upper layer software when work is complete.
It is possible that a DMA controller may be shared by many concurrent software threads running on one or more processors. Conventionally, a DMA controller maintains the logical concept of “channels”, whereby a channel provides the interface between a single software thread and the DMA controller. In other words, each software thread is associated with a channel. More concurrent driver threads require more DMA controller channels.
It has often been practice to provide multiple channels to address two concerns: thread sharing and quality of service. Oftentimes, multiple concurrent threads are used in systems for the parallel processing different aspects of data flow. Where multiple independent threads are deployed, sharing a common DMA controller can be cumbersome to manage. The software device driver in this case must not only provide the DMA controller with communications, but must also manage an arbitration scheme with upper layer software threads to determine which work gets done next. If this work is being carried out by multiple microprocessors, the overhead of coordination between threads is very complicated. The overhead coordination requires a certain level of software handshaking to determine which thread gets access to the controller at any particular time.
From a quality of service perspective, it is common to have higher and lower priority activities. For example, a software thread may queue a low priority transfer for a DMA controller. At some later time a different thread may be queued, which needs to run a higher priority task on the DMA controller. The ability to pre-empt low priority activities with higher priority tasks is a highly desired feature. Without such capability, a high priority operation must wait until a low priority operation is completed.
A multi-channel DMA controller addresses these issues where different software threads can be bound to specific channels and the underlying DMA controller hardware sorts out the access profile to memory based upon channel priorities. A disadvantage of the channel approach is that there are a limited number of hardware channels. If more logical threads exist than physical channels, then a software mechanism must once again be deployed to take care of the resource contention.
DMA controllers must also maintain a certain level of atomicity with respect to the execution of operations. This means that operational sequences must complete in the order programmed by software. This is typically accomplished using a run-to-completion model whereby a DMA channel completes all operations of a first CD, before moving onto the next CD. However, a brute-force run-to-completion methodology may prevent the data moving engine from performing un-related operations (operations from different CD lists) in parallel, even if the engine is capable.
The communication between software and the DMA controller hardware is typically handled through a programmed input/output (IO) interface. That is, the software device driver programs control registers within the DMA channel, which causes the DMA to carry out the desired action. When the DMA controller is finished it communicates back to software, either through use of a hardware interrupt request, or through setting of a status bit that is polled by software. The software must wait until the current instruction sequence is complete before programming the next sequence. During the software/hardware handshake period the DMA channel is idle waiting for the next CD, thus resulting in dead time that could have been used for real work. To overcome this dead time, DMA controllers may deploy the concept of control descriptor sequences (CDS) and descriptor rings.
The descriptor ring provides a form of FIFO where the software adds new items in memory for the DMA channel at the tail of the ring, while the DMA controller processes CDs from the head of the ring. In this way, the software manages the tail pointers and the hardware (HW) manages the head pointer. Such schemes have the disadvantage of requiring software overhead to keep track of pointers.
A direct CD embodies the command descriptor completely within the command message. A pointer CD contains a pointer to a memory location where a single command descriptor, or a linked list of command descriptors, is located. In this case, a single command message may actually represent a number of chained command descriptors.
In many SoC multicore (multi-processor) systems, an interrupt is assigned to single CPU core from an external device. For example, in a packet processing application an Ethernet device receives the packet and triggers an interrupt to a CPU core. The CPU core acknowledges interrupt, retrieves packet context, and processes packet. The CPU core has to wait for this packet to be processed before it processes the next packet coming in via the same Ethernet port, which causes delay to process second packet.
In a multicore system with multiple Ethernet ports, each port can be assigned to a different CPU core. This is called interrupt affinity. In this case, two packets can be processed in parallel by two different CPU cores if the packets come in on different ports. However, this interrupt affinity introduces inefficiency in using the CPU cores at their full capacity when one port is receiving data, and other port is receiving a lesser amount of data, or no data at all. In this case, one core(s) can be sitting ideal while another core is processing packets, delaying the packet processing of other packets on the same interface.
It is possible to parallel process interrupts in multicore system. The pipeline packet process divides complete packet processing in multiple stages. For example, stage 1 is ingress processing, stage 2 is forwarding, and stage 3 is egress processing. In this case it is possible to perform stage 1 and stage 2 processing in CPU 0, and stage 3 processing in CPU 1. The problem with this method is that each CPU is given a fixed allocation of tasks to be performed. If the processing workload is uneven, there is no means to redistribute the workload dynamically.
With transmission control protocol (TCP) flow pinning, the packets are divided into groups that are processed by particular CPUs based upon packet type. For example, even if the packets are coming in from same Ethernet port, they may be delivered to different CPUs based on some pre-classification. That is, a particular TCP flow is always be given to an associated CPU. However, if the packets coming in one flow are more than the packets coming in other flow, the CPUs are not utilized efficiently. Further, if packets keep coming from only one flow, then only one CPU is used. This method also limits the load balancing to certain packet types.
Another solution is to trigger the interrupt to all the cores when a packet is received on one port. All the cores acknowledge the interrupt, and they all compete to obtain a global lock. Whoever gets the lock first, processes the packet. If a CPU is not able to get the lock, it sends an end of interrupt indication. The drawback to this method is that the cores waste time in competing for the global lock. Since this competition occurs for every interrupt, a significant amount of processor time is wasted. Also, there is no software control or flexibility in the system, and no predictability as to which core will process a packet. Further, no weight can be given to any CPU. For example, it is impossible to configure a system to grant 6 interrupts to CPU 0, for every 4 interrupts to CPU 1, as the competition for global lock is completely random.
It would be advantageous if the above-mentioned problems could be solved by balancing the load between multiple cores that are parallel processing packets.