Various mechanisms exist for processor optimization. Specifically, many processors are designed to use some sort of caching mechanism. Caching mechanisms exhibit problems due to memory access delays. It is often the case that increasing the capacity of a memory storage device increases the delays associated with them. Thus, accesses to larger memory result in longer delays. The memory access delays have an impact on processor execution performance. Most applications have characteristics where only a small set of accessed data is required over the execution of the program. Basically, those frequently accessed memory are brought “closer” to the processor, i.e., in hardware cache.
Some special embedded processors, for instance, an IXP network processor available from Intel Corporation, do not have hardware cache. These embedded processors may be found in cellular phones, MP3 players, and other devices. For these embedded processors, the cost of including cache on die may be prohibitive. Network processors tend to process many disparate packets of information. Each packet may be processed individually. To gain more throughput/bandwidth, the processor die may be allocated among many processor elements, where each packet may be processed by a different embedded processor. It may be preferable to have additional processors rather than implement hardware cache on these systems. It is also believed that there is less locality in network applications than in other applications. Thus, there are less “frequently used” data to be put into cache.
For general application, designers lean toward including a hardware cache rather than a software cache. In existing systems, software caching typically does not perform as well as hardware caching. Some researchers have attempted utilizing on-chip memory in conjunction with some software control. For instance, one article describes that                “some digital signal processing (dsp) chips have a small, fast, on-chip memory that the programmer can use to improve access times. These on-chip memories are not caches; instead, they are located in a disjoint address space. This simplifies their implementation by eliminating any need for relating on-chip memory addresses to off-chip memory addresses, for associative lookup, and for automatic replacement. Instead, the designers make the programmer responsible for moving data between main memory and the on-chip memory in a timely and efficient way.”[See, K. D. Cooper and T. J. Harvey, “Compiler-Controlled Memory,” In Proceedings of ASPLOS-VIII, San Jose, Calif., October 1998.] The methods discussed in this research present an alternative use for a small portion of on-chip memory as a holding place for spilled values, or a small compiler-controlled memory (CCM).”        
Existing processor architectures utilize a memory hierarchy with small, but fast memory, such as caches, near the processor extending all the way to large, but slow memory, such as dynamic random access memory (DRAM) or disk drives. This design facilitates both large memory spaces while minimizing memory access latencies to frequently accessed data.
Specialized processor architectures have increasingly become important for applications with specific needs. Intel IXP processors, as discussed above, for example, may be embedded within routers to process packets. Because individual packets can be processed independently of others, one IXP processor contains many lightweight, multi-threaded micro engine (ME) cores that may be dedicated to the task of processing packets. In Intel platforms using XScale® technology, there is an Xscale® core to process control-plane code. Xscale® is an embedded microprocessor architecture derived from StrongARM technology. The software cache tries to cache data that is frequently read and infrequently written. The memory hierarchy in the IXP includes a small, but fast, local memory in each ME, and scratchpad memory, static RAM (SRAM) and DRAM memories (with increasing access latencies) shared between all the MEs.
These MEs have been designed without caches to minimize the size of each core, and because it has been believed that the packets in network applications have no spatial or temporal memory locality that would benefit from caching. This assumption arises from the notion that network processors simply read a packet once, do some work on the packet, and upon completion, simply send the packet along.
Software-controlled caching has been proposed as a way to benefit from data locality without hardware caches by utilizing the small, but fast local memory. Here, hardware cache functionality can be emulated with software routines. Software-controlled caching may be implemented with limited features to minimize software overheads. For example, cache coherency is the necessary condition to guarantee correct execution in any application where copies of the cached data in different MEs have the same values. However, in existing systems, it is expensive and inefficient to support cache coherency in software-controlled caching.