Modern digital signal processors (DSP) faces multiple challenges. Workloads continue to increase, requiring increasing bandwidth. Systems on a chip (SOC) continue to grow in size and complexity. Memory system latency severely impacts certain classes of algorithms. As transistors get smaller, memories and registers become less reliable. As software stacks get larger, the number of potential interactions and errors becomes larger. Even wires become an increasing challenge. Wide busses are difficult to route. Wire speeds continue to lag transistor speeds. Routing congestion is a continual challenge.
To a first order, bus bandwidth is proportional to the width of the bus in bits times the bus clock rate. To increase bandwidth to the processor required a wider bus running at a faster clock rate. However, that can lead to more wires and greater latency, because faster clock rates typically require greater pipelining. More wires produces more routing issues. Thus the processor needs lead either lower clock rates, overly large chips or both.
Memory systems continue to provide scalability challenges to the central processing unit (CPU). In the Texas Instruments C6600 family of DSPs, the level one data (L1D) cache line is 64 bytes long, and the CPU can consume 16 bytes per cycle. That means for high-bandwidth streaming workloads that include no significant data reuse), the CPU can consume an entire cache line every 4 cycles. It costs a minimum of 7 cycles to read a new line into the cache on this family of DSPs. Generally more cycles are needed. Streaming workloads pay a very large cache penalty even if all their data resides in level two (L2) RAM. The in-order nature of the CPU limits the ability to hide this latency penalty.
The level two (L2) cache lines for this DSP family are 128 bytes long. Due to limited buffering the L2 controller can only issues four L2 line fetches at a time. The round trip latency to fulfill those fetches, though, ranges from about 20 cycles for multicore shared memory controller (MSMC) RAM to possibly hundreds of cycles for a third level dual data rate memory (DDR3). A prefetcher helps, but is has a limited number of outstanding prefetches. Assuming a 100 cycle round-trip latency, a maximum of 12 outstanding line fill requests or prefetches outstanding (48 dataphases total), and a 256-bit bus operating at the CPU clock rate, the bus utilization only reach about 48%. Thus even the best-case peak gets poor bus bandwidth utilization. Deeper buffering could help. The in-order nature of the CPU would make using deeper buffering difficult. Real world usage patterns would produce far lower bus utilization.
One way to avoid cache overheads is to transfer data directly into the DSP's local memories (LID and L2 RAM). The C6600 family of DSP provides an SDMA port permitting system peripherals to read and write the DSP's local memories. However, the SDMA busses are completely separate from the busses used by the caches. Thus the SDMA busses are smaller and slower than peak applications require. The C6600 family of DSPs has similar bus duplication to keep demand traffic, direct memory access (DMA) traffic and snoop traffic separate.
Thus memory system overheads limit performance and traditional approaches don't scale well. Applications continue to demand increasing performance. In addition to raw performance, todays SoCs also present interesting software integration challenges such as: communication between general-purpose operating systems OSes and the DSP (coherency); addressing large amounts of memory (large physical address); and isolating distinct tasks from each other (system integrity). Existing DSPs provide workable, if subpar solutions to these problems.
The C6600 family of DSPs do not provide system-level hardware coherency. They rely on software handshakes and software-managed coherence between processors, system hardware, and general purpose OSes running on other processors. This technique works and leads to simpler, easier to verify hardware, but it imposes a large software overhead in terms of complexity and run-time cost.
The C6600 family of DSPs use a static address translation scheme, MPAX (Memory Protection and Address eXtension) to map each a DSP 32-bit virtual address space into the a system 36-bit address space. The MPAX unit is limited to providing 16 programmable address remap ranges. The MPOAX does not fit well with more traditional general purpose OS memory management. The MPAX translation happens after the cache hierarchy, so the caches effectively cache virtual addresses. This makes it very expensive to dynamically update the MPAX mappings. Thus software usually employs a static mapping for each DSP. While this allows isolating individual DSPs from each other, it doesn't allow isolating different tasks on the same DSP easily. Future application workloads will not only want to put more tasks on the DSP, but also have those tasks communicate directly with tasks running under a traditional virtual memory operating system such as running on a traditional general-purpose processor. Larger systems might even want to add virtualization so that multiple virtual machines need to interact with all of the DSPs.