FIG. 1 shows a block diagram of a prior art CPU-based system 100 implementing a cellular phone with video capability comprising the following modules: a CPU 110, a color LCD module 120, a camera 125, a keypad 130, a CPU memory 140, a modem 150 (modem for example being implemented by a DSP), an RF transceiver 160, a flash memory 170, an EPROM 180, a SIM Card 190, and an audio codec 105. In operation, intercepted video and/or graphics data is received through RF transceiver 160 and modem 150, and processed by CPU 110. CPU 110 transfers the video and/or graphics data to memory 140 for temporary storage during the processing. The processed data is output on LCD 120. Outgoing video and/or graphics data is received from camera 125 and processed by CPU 110. CPU 110 transfers the video and/or graphics data to memory 140 for temporary storage during the processing. The processed data is transmitted out through modem 150 and RF transceiver 160. Flash memory 170 and EPROM 180 store the CPU program and constant parameters.
In order to enable video capability, CPU 110 must be provided with software to handle video and graphics. Special instructions (often termed multi-media extension instructions) are designed for CPU 110, and video/graphics instructions are executed in series with other instruction. In addition, the frequency of CPU 110 must be increased compared to a CPU in a system without video capability so as to meet the real time requirement. The increased clock frequency linearly increases power consumption.
An additional source of power loss relates to the data transfer across the input/output pins (interconnect) between the CPU 110 IC and the memory 140 IC. This kind of data traffic involves charge/discharge of a large capacitive load on the input/output buffers and therefore large power consumption. Quantitatively, the amount of power wasted over the input/output pins of memory IC 140 is given by:PI=PID+PICwhere PID is the power consumed over the data pins, and PIC is the power consumed over the control and address pins. Typically, the former is much lager than the latter, andPID=CIO*VIO2*0.5*BIOwhere    CIO is the capacitive load of the external data bus, in Farads.    VIO is the supply voltage of the I/O of the memory device, in Volts.    BIO is the effective bandwidth of the transactions over the data bus, in bits/second
For example for a typical DRAM memory IC, the wasted power can be approximated as follows:    CIO=10*10−12 Farad    VIO=2.5 Volts    BIO=200*106 bits/sec, thus    PID=10*10−12*2.52*0.5*200*106=6.25 mWand therefore PI>6.25 mW
In another prior art system 200 of a cellular phone with video capability illustrated in FIG. 2, a signal processing core 215 can be embedded on the same die with a CPU 210, with core 215 handling the video and graphic tasks. Because the video and graphic tasks are handled by core 215, the clock frequency of CPU 210 need not be increased beyond the frequency of a CPU for a cellular phone without video capability, and therefore the power that would have been wasted by the increased frequency is conserved. However, system 200 is nevertheless not very efficient from the power consumption standpoint because the IC including CPU 210 and core 215 still exchanges a lot of data with memory 140 and therefore consumes a lot of power (see above approximation) across the interconnect between the CPU 210 IC and memory 140 IC.
FIG. 3A illustrates another prior art system 300 of a cellular phone with video capability. A general purpose or application specific digital signal processor (DSP) 385 is placed external to a CPU 310. DSP 385 handles the video and graphics tasks while CPU 310 handles the other tasks. Therefore there is no requirement to increase the clock frequency of CPU 310 compared to a CPU in a cellular phone without video capability. However DSP 385 requires an additional memory 395, for example a DRAM or RAM 395 which can either be embedded within DSP 385 or placed as an off-the-shelf memory IC external to DSP 385 and connected to DSP 385.
FIG. 3B shows a similar system 320 with application specific CPU 380 and an SDRAM 390 replacing DSP 385 and (D)RAM 395.
Both systems 300 and 320 have an increased IC count compared to systems 100 and 200 and therefore an increased size and cost. Systems 300 and 320 are also not power efficient because data has to be moved between CPU 310 and memory 140, between CPU 310 and DSP 385 or application specific CPU 380, and between DSP 385 or application specific CPU 380 and memory 395 or 390, consuming a lot of power. The data transfers between DSP 385 or application specific CPU 380 and memory 395 or 390 is typically (although not necessarily) the highest traffic of the data transfers listed above for systems 300 and 320, and therefore the most wasteful in power because video compression/decompression algorithms require multiple accesses to the data
There are also related art systems which include a processor embedded in CPU memory.
U.S. Pat. No. 6,026,478 to Dowling describes a VLIW (very large instruction word) processor that is connected to an embedded DRAM VLIW extension processor, that also functions as the DRAM of the VLIW CPU. U.S. Pat. No. 6,026,478 partitions the allocation of tasks between the CPU and the embedded DRAM processor at the instruction level.
As a result of splitting the program at the instruction level, the amount of data (where “data” includes instructions) exchanges between the CPU and the embedded DRAM processor is decreased by a certain amount with respect to a regular CPU-DRAM paradigm. The disadvantage of the system is in the required complexity. For example, the embedded memory processing unit needs to sense the stream of instructions that are executed at the same time by the CPU, and the embedded memory processing unit needs to share the instruction caching of the CPU. In addition, there are still significant data exchanges between the CPU and the DRAM in order to synchronize the execution at the instruction level.
U.S. Pat. No. 5,396,641 to Iobst et al. describes a single instruction multiple data (SIMD) and multiple instruction multiple data (MIMD) processors integrated with DRAM. Each type of processor has an external DRAM interface (i.e., the DRAM can be accessed as a common DRAM). However, there are extra control lines for operating the embedded processor from an external “host”.
Moreover the invention disclosed in U.S. Pat. No. 5,396,641 does not support simultaneous internal processing and external data transfers. An internal computation cycle can only take place instead of a memory access cycle. This approach makes it impossible to use the embedded DRAM processor at the same time as the CPU uses the embedded DRAM as its main memory.
U.S. Pat. No. 5,678,021 to Pawate et al. discloses a smart memory that includes a data storage and a processing core for executing instructions stored in the data storage area. Externally, the smart memory is directly accessible as a standard memory device. However the smart memory does not support simultaneous internal processing and external data transfers. An internal computation cycle can only take place instead of a memory access cycle. This approach makes it impossible to use the processing core at the same time as the CPU uses the data storage as its main memory.
What is needed in the art is a logic embedded memory where the memory can be accessed simultaneously by an embedded ASSPU and by an main processing unit.