FIG. 1(A) is a partial perspective view showing a system 100 including an embedded processor (or embedded controller) 110 and an external memory device 120 mounted on a suitable substrate 101 (e.g., a printed circuit board). Note that embedded processor 110 and external memory device 120 are discrete components (i.e., separately fabricated and packaged) such that communication between the devices is transmitted over external connections (e.g., copper traces 115 formed on substrate 101). System 100 may also include one or more additional devices 130 (e.g., sensor or actuator circuits) that are connected to embedded processor 110 by corresponding connections provided on substrate 101. Embedded processor device 110 and external memory device 120 cooperate to perform a specific control function (i.e., as opposed to general purpose computing) within system 100. For example, embedded processor 110 may generate control signals in response to a program stored on external memory device 120 that control one or more components of system 100 (e.g., functions executed by devices 130).
FIG. 1(B) is a partial perspective view showing an alternative system 100A including an embedded processor 110A and an external memory device 120A. As in the previous example, embedded processor 110A and memory device 120A are discrete components. However, system 101A utilizes a “die-to-die” hybrid package arrangement in which memory device 120A is mounted directly onto processor 110A using known techniques. Note that this alternative arrangement also requires external connections between processor 110A and memory device 120A (e.g., via solder bump/contact pad connections), and otherwise operate essentially the same as system 100. Therefore, in the following discussion, references to system 100 are understood to address similar structures of system 10A.
FIG. 2 is a block diagram showing a portion of system 100 in additional detail. Embedded processor 110 includes a processor core 210, a program counter 220, an instruction cache 230, and several additional circuit structures (omitted for brevity) that are integrated in a System-On-Chip (SoC) arrangement. Processor core 210 executes a program that is at least partially stored on memory device 120. This program includes a sequence of instructions representing an algorithm that defines the specific control function performed by system 100. Program counter 220 stores an instruction address value that is used to call or “fetch” the next instruction in the program's instruction sequence for loading into processor core 210. Instruction cache 230 is used to temporarily store previously used instructions in order to facilitate faster processing. That is, the first time an instruction is called (i.e., its address appears in program counter 220), the instruction must be read from external memory device 120 and then loaded into processor core 210, which requires a relatively long time to perform. During this initial loading process, the instruction is also stored in a selected memory location of cache 220. When the same instruction is subsequently called (i.e., its address appears a second time in program counter 220), the instruction is read from cache 230 in a relatively short amount of time (i.e., assuming its associated memory location has not been overwritten by another instruction). Note that the number of instructions stored in cache 220 is determined by the size (i.e., number of memory locations) of cache 220, and therefore the size of cache 220 typically determines the likelihood that a particular instruction will be quickly read from cache 220 (as opposed to the relatively long process of reading the instruction from external memory device 120).
Like all other present day processor devices, embedded processors have benefited from advances in semiconductor fabrication technology to provide increasingly greater performance and operating frequency (MHz). However, for cost reasons, many electronic systems incorporating embedded processors are forced to use inexpensive, relatively slow external memory devices to store associated program instructions (usually FLASH type memory devices, and usually several MByte). At one point the operating frequencies of embedded processors and inexpensive external memory devices were well matched. However, more recently, the operation frequencies of embedded processor cores have increased significantly (e.g., to approximately 400 MHz), while the operating (read/write) frequencies of inexpensive external memories have remained relatively slow (e.g., approximately 40 MHz). That is, referring to FIG. 2, each time an instruction is not stored in cache 230 (referred to herein as a “cache miss”) and must be read from external memory device 120 (a “miss fetch”), processor core 210 must stop and wait until the miss fetch process is completed (a “fetch return”). When the operating frequency of the embedded processor core is substantially faster than that of the external memory, the penalty time (i.e., the unused processing cycles) associated with each cache miss significantly reduces the effective operating speed of the embedded processor. For example, assuming a processor core executes one instruction per clock cycle (a.k.a. “clock”) and a cache miss penalty time of 100 clocks, then even if the cache miss rate is 1% (i.e., one cache miss per 100 instructions), then 200 clocks are required to complete 100 instructions, which reduces the effective processor efficiency to 50%. Further, in reality, a 1% cache miss rate is well above average, and realistic cache miss rates are typically much higher, thereby further reducing effective processor efficiency. Consequently, even when an embedded processor includes a relatively large and fast instruction cache, the effective performance of a 400 MHz processor core is about the same as a 40 MHz core because the cache miss penalty time is so large.
Conventional approaches to solve the cache miss penalty problem described above typically involve increasing the size of the instruction cache, or use memory overlays or large amounts of memory on the embedded processor chip. However, increasing the size of the cache memory increases the embedded processor cost, and only partly solves the cache miss penalty problem. That is, a larger cache increases the overall size of the embedded processor, thereby reducing production yield (e.g., chips per wafer) and thus increasing the cost per embedded processor. Further, as set forth in the example above, cache misses will periodically occur no matter how large the cache, with each cache miss costing a significant cache miss penalty (on the order of 100 clocks), so the performance of the embedded processor remains far below the maximum operating frequency of the embedded processor. Therefore, the only sure way to completely avoid the cache miss penalty is to store all program instructions on-chip (i.e., eliminate the external memory device completely). However, this further increases the chip size (and hence the chip cost, and significantly increases operating power. In the highly competitive industries that utilize low cost embedded processors, such high cost, high power alternatives are rarely considered acceptable.
Hence there is a need for an embedded processor and associated method that address the cache miss penalty problem (defined above) without significantly increasing the cost and power consumption of the embedded processor.