1. Field of the Invention
This invention generally relates to computer processing and, more particularly, to a system and method for processing packets with a reduced number of memory accesses.
2. Description of the Related Art
As noted in Wikipedia, direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit (CPU). Many hardware systems use DMAs, including disk drive controllers, graphics cards, network cards and sound cards. DMA is also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-on-chips (SoCs), where its processing element is equipped with a local memory (often called scratchpad memory) and DMA is used for transferring data between the local memory and the main memory. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Similarly a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, thus permitting computation and data transfer concurrency.
Without DMA, using programmed input/output (PIO) mode for communication with peripheral devices, or load/store instructions in the case of multicore chips, the CPU is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU can initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is especially useful in real-time computing applications where not stalling behind concurrent operations is critical. Another and related application area is various forms of stream processing where it is essential to have data processing and transfer in parallel, in order to achieve sufficient throughput.
A DMA transfer copies a block of memory from one device to another. While the CPU initiates the transfer by issuing a DMA command, it does not execute it. For so-called “third party” DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. In an embedded processor or multiprocessor system-on-chip, it is a DMA engine connected to the on-chip bus that actually administers the transfer of the data, in coordination with the flow control mechanisms of the on-chip bus.
A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation usually does not stall the processor, which as a result can be scheduled to perform other tasks unless those tasks include a read from or write to memory. DMA is essential to high performance embedded systems. It is also essential in providing so-called zero-copy implementations of peripheral device drivers as well as functionalities such as network packet routing, audio playback, and streaming video. Multicore embedded processors (in the form of multiprocessor system-on-chip) often use one or more DMA engines in combination with scratchpad memories for both increased efficiency and lower power consumption. In computer clusters for high-performance computing, DMA among multiple computing nodes is often used under the name of remote DMA.
A general purpose programmable DMA controller is a software-managed programmable peripheral block charged with moving or copying data from one memory address to another memory address. The DMA controller provides a more efficient mechanism to perform large data block transfers, as compared to a conventional general purpose microprocessor. The employment of DMA controllers frees up the processor and software to perform other operations in parallel. Instruction sequences for the DMA, often referred to as control descriptors (CDs or descriptors), are set up by software and usually include a source address, destination address, and other relevant transaction information. A DMA controller may perform other functions such as data manipulations or calculations.
Control descriptors are often assembled in groups called descriptor sequences or rings. Typically, the software control of a DMA controller is enabled through a device specific driver. The device driver is responsible for low level handshaking between upper layer software and the hardware. This device driver manages the descriptor rings, communicates with the DMA controller when work is pending, and communicates with upper layer software when work is complete.
The use of a DMA can lead to cache coherency problems. A CPU equipped with a cache and an external memory can be accessed directly by devices using DMA. When the CPU accesses location X in the memory, the current value is stored in the cache. Subsequent operations on X update the cached copy of X, but not the external memory version of X. If the cache is not flushed to the memory before the next time a device tries to access X, the device receives a stale value of X. Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory, then the CPU operates on a stale value of X.
This issue can be addressed in one of two ways in system design: Cache-coherent systems implement a method in hardware whereby external writes are signaled to the cache controller which then performs a cache invalidation for DMA writes, or cache flush for DMA reads. Non-coherent systems leave this to software, where the OS must then ensure that the cache lines are flushed before an outgoing DMA transfer is started and invalidated before a memory range affected by an incoming DMA transfer is accessed. The OS must make sure that the memory range is not accessed by any running threads in the meantime. The latter approach introduces some overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line individually.
Communication systems in a wide variety of market segments, such as packet processing routers, wireless access points, wireless base stations, media gateways, networked access storage devices, cloud computing, and many more, need to support fast packet processing. To that end, CPU speeds have significant improved over the years. Today CPUs run as fast as 2.0 GHz, and multiple CPUs are available to process a packet in single chip. The problem is that the performance of memory technology has not kept pace. Even with multiple powerful CPUs to process packets, memory access latency has become the bottleneck in packet processing.
To improve memory access, the industry has evolved to a fast L2 cache implementation. Instead of accessing data in system memory, a cache controller can bring this data in the cache, making it much faster for CPU access. However, the CPU can only access a packet header after the header is loaded into cache, until then, the CPU must wait for data to be available in the cache. On top of that, many applications require the actual packet data, so that out-of-order execution is not possible.
The cache based solution is limited. It works well only when the data is already in cache before the CPU attempts access. But to share the data between CPU and other hardware devices, the memory address must be made coherent. To make the memory address coherent, a snooping algorithm needs to run for every access to memory. This results in performance penalty, as each access must snoop the system bus to find the latest copy of the data at a particular address. The snooping takes as long as 30 to 40 CPU cycles.
Further, when the hardware needs to access cached data, the snooping algorithm generally flushes the data back to system memory and invalidates the cache for this address. This results even in more performance penalties, as the data needs to be written to system memory first, and then needs to be read by hardware from system memory. On top of that, cache invalidation logic also needs to run.
Overall, the following latencies need to be considered for complete packet processing and forwarding:
Latency 1: A packet arrives at an Ethernet Medium Access Control (MAC) hardware interface. The Ethernet MAC hardware uses the DMA to send the data to system memory (i.e. Double Data Rate (DDR) memory) from an internal first-in first-out (FIFO) memory.
Latency 2: The CPU tries to access packet header for packet processing. There is a cache miss for this access, and system memory is accessed to bring the packet header to cache. When the CPU is done processing the packet, it sends the Ethernet MAC an indication that it is the sent out. At this time nothing happening, as cache is configured as write back.
Latency 3: The Ethernet MAC hardware tries to access this data from memory. As data was cached, the snooping logic flushes the data from cache to DDR, and invalidates the cache. So, there is a latency in copying data from cache to DDR.
Latency 4: The Ethernet MAC uses the DMA to send the data from the DDR to its internal FIFO for egress. Here also, the DDR needs to be accessed.
It would be advantageous if system memory access latencies, such as bus snoop logic, cache flush, and invalidation logic could be minimized in packet processing.