Memory latency is one of the important performance bottlenecks in modern computer systems. Technology forecasts predict that the memory latency problem will only get exacerbated as the disparity between processor core speeds and memory speeds increases.
A common computer architecture technique to reduce the effective memory latency incurred by processors is caching. Processors can access cache memory much faster than main memory because of the physical proximity of the cache to the processor, the size of the cache compared to main memory, and the faster device technology that is commonly used to implement caches. Fundamentally, caches reduce memory latency by exploiting the temporal and spatial locality that is present in application program memory accesses. Memory latency is reduced by amortizing the memory latency incurred in retrieving memory (or cache) blocks (or lines) from main memory to the cache over the number of times that a cache line is reused due to locality.
Since caches are implemented in hardware, they use simple hardware mechanisms to increase their effective use of locality. First, they are typically organized in multi-byte cache lines in order to exploit spatial locality. Second, they use least-recently-used (LRU) or pseudo-LRU replacement policies to exploit temporal locality by keeping the most recently used cache lines in the cache. Unfortunately, the simple cache LRU replacement policy might not be effective for some applications due to a combination of the size and organization (e.g., degree of associativity) of the cache, and the memory access patterns of the application programs running on the computer system. 
Another technique used in addition to caches to hide memory latency is prefetching. Hardware or software prefetching techniques can be used to predict cache lines that will be accessed in the near future by the processors and miss in the cache(s) and issue early memory fetches for those lines in anticipation of the processor access, thereby eliminating or reducing the memory latency stalls incurred by the processor as a cache miss is being serviced. Unfortunately, compiler algorithms or programmers might find it difficult to predict memory access behavior of an application at the cache line granularity in order to schedule the software prefetches sufficiently in advance of an access. Hardware prefetch techniques usually implement simple memory access pattern detection mechanisms, making it difficult for them to be effective for other memory access patterns.
For some applications, it is possible to overcome the aforementioned limitations of cache replacement algorithms and traditional hardware or software prefetch techniques to reduce and/or tolerate memory latency by giving application programs control over the placement of data in on-chip memory using cooperative application, and operating system support. In some applications, although it may be difficult to understand the caching behavior at the cache line level, information about critical data structures that need to be maintained closer to the processor may be readily available. In such cases, application programs can either explicitly issue block prefetch requests into an on-chip memory in order to ensure that affected memory locations are quickly accessible. Alternatively, applications may issue hints to the operating system requesting that such structures be allocated in on-chip memory. Applications can also explicitly free memory allocated in the on-chip memory banks or issue hints to the operating system to do so.
Therefore, there is a need for methods and arrangements for reducing memory latency incurred by processors by providing a measure of control over on-chip memory (OCM) management to software applications, implicitly and/or explicitly, via an operating system or runtime system.