Typically, an embedded system memory controller services several clients each with a diverse set of data transfer patterns. For example, some clients will typically transfer large bursts of data (i.e., 4 KB) to/from memory thus representing the most efficient and highest bandwidth transfers. Other clients, such as the system management processor, will typically generate smaller data burst transfers in order to service its local cache memory. Still other clients, such as embedded processing nodes, will generate many smaller write/read operations to/from the system memory in order to manipulate small pieces of meta-data or state variables. These are typically RMW (Read Modify Write) sequences.
Nearly all processors have implemented atomic operations for allowing efficient synchronization primitives in symmetric multi-processor environments. However, these primitives tend to map to register sized (8 byte) operations that may be expressed with a single machine instruction for the processor. Standard implementations must flush all processor cores' caches to correctly implement atomic operations, which leads to a low number of expensive, high latency instructions.
Embedded applications using a systolic array of independent processors with message passing interfaces have a much different requirement for memory access. In this non-limiting example, a processor in a systolic array may request an atomic add of two integers. A standard implementation would implement an “atomic add” machine instruction. However, the systolic array is sharing access to a central memory controller with thousands of other cores. Blocking the processor while the transaction is in progress would be extremely detrimental to efficiency and scalability of the array.
In another non-limiting example, a client may want to “zero” a large range of system memory. Traditionally, this would require the client to physically transfer the complete “zero” data pattern from itself to the memory controller for subsequent writing to the physical memory. It is evident that the transfer of a long sequence of write data that consists of an all zeros pattern is quite inefficient and consumes client and client to memory controller interface bandwidth.