It has been recognized that the performance of embedded central processing units (CPUs) has been impaired due to the need for off-chip memory storage devices. In this regard, embedded CPUs typically take the form of a single-chip processor having some peripheral components on the processor chip. Memory associated with the processor for storing instructions and data is typically located off the processor chip. Accordingly, both processor instructions and data must often be read across a common bus (e.g., “Von Neumann bus”) from an off-chip memory storage device. This consumes the bulk of the critical path for items such as “instruction decode” and “memory-to-register” data transfers.
Moreover, off-chip memory storage devices add to the dollar cost of such system incorporating CPUs. Furthermore, off-chip memory storage devices consume valuable real-estate of printed circuit boards (PCBs) which is often at a premium in such arrangements as mini-PCI (Peripheral Component Interconnect) and PCMCIA (Personal Computer Memory Card International Association).
In addition, power consumption is also greatly increased due to performing off-chip memory accesses, which in turn adversely affects the battery life in a wireless application.
In view of the foregoing observations, it has been recognized that an ideal design for processing memory transfer instructions, should seek to: (a) minimize the number of off-chip memory instruction fetch accesses to improve data transfer speed and minimize power consumption, and (b) reduce the size of the off-chip memory storage device to minimize use of real-estate area and production costs. These objective are particularly important in the case of wireless applications which use embedded CPUs.
One way in which the prior art has attempted to address the foregoing problems is by providing an on-chip cache memory. One drawback to this approach is that it adds significant production costs to produce the embedded CPU (e.g., to produce an ASIC). Other drawbacks may include absence of locality-of-reference, coherence problems, and thrashing problems depending on the application.
Another prior art solution has included the use of dual busses, namely a separate instruction bus and data bus (i.e., “Harvard bus”). One drawback of this approach is that a dual bus system is too power hungry and expensive for many embedded CPU applications. Furthermore, the pins needed to provide dual busses are often not available.
Other prior art approaches include the use of a “looping” execution method and an “unrolling” execution method to increase data transfer speed. These and other approaches also have significant drawbacks, as will be discussed below.