In existing communication systems, it is common to use hardware accelerators or coprocessors for various processing functions that access a common or shared memory area. Most of the accelerators or coprocessors have critical real-time requirements on their accesses to the shared memory. The shared memory is therefore usually positioned on the same die as the accelerators or coprocessors to allow for very fast accesses (for example, 1 clock).
As data rates increase, for example, Long Term Evolution (LTE) Category 4 can have an aggregated throughput of 200 Mbits/s, the memory required also increases in order to maintain a desired system throughput. The physical memory size starts to become a prohibitive cost factor. It therefore becomes necessary to move the memory into external components that may have access times of 20-30 clocks or greater. Burst access is usually employed to enable the average access time for sequential data to be reduced.
In a communication device or system that has multiple clients (for example, 16 clients) which burst access an external memory, a local memory area is usually allocated to store and assemble the burst requests. Since this memory area is localized inside each client (for example, inside an acceleration block), it cannot be shared between the multiple clients. Thus, power and area (and therefore cost) is wasted in fetching data that may already have been read from the external memory. Data coherency also becomes a significant issue, which requires more power to be dissipated in writing data back to the memory so that another client can read it and an inter-lock mechanism may be required between different processing blocks. There is also considerable added complexity for each client in controlling the burst requests and hence more area and cost is incurred.
These problems can be solved by employing a centralized multi-port memory controller with a cache-based approach. Each client has straightforward, direct random access to the same coherent large virtual on-chip RAM space. This is implemented by using a cache memory, a minimal amount of real on-chip RAM, that provides indirect access to the large external memory space that may be accessed in bursts to maintain high performance. A key issue to overcome with conventional cache allocation approaches is that the memory controller is multi-ported (e.g. 16 clients or more), which means that prioritized memory accesses from different clients are unlikely to maintain any spatial (sequential) coherency. This may cause significant problems to classic cache operations because spatial and/or temporal locality of access cannot be relied upon to prevent the discard of a given cache line. This may impact both system latency and system power consumption due to wasted memory access cycles caused by cache thrashing.
One way of solving this problem is to allocate much more memory size to the cache, but this is costly and wasteful.