1. Technical Field
The present invention relates to low-power battery operated vector processors, and more particularly to methods and devices for reducing the power consumed by dynamic random access memories to benefit portable, battery powered equipment.
2. Description of the Prior Art
In the past, vector processors found their greatest uses in very expensive, energy-hungry supercomputers like those designed by Cray Research. Vector processors represent a subclass of single instruction multiple data (SIMD) systems that can use arrays of specially structured central processing units (CPU's) to act on large data vectors, rather than on single data items. It has only recently become practical to use vector processors in battery-powered portable equipment.
Digital signal processing techniques are now being employed in cellular telephone and other radio applications. Unfortunately, high CPU clock speeds increase power consumption and shorten battery life to unacceptable levels. Using vector processors to do the digital signal processing has been challenging, first because the overall power consumption of their data memories is high and second, because getting enough memory bandwidth at the processor interface has been difficult and expensive to implement.
At the current stage of semiconductor integrated circuit technology, it is now possible to integrate a fully functional vector processor with its main memory. For example, Duncan G. Elliott, W. Martin Snelgrove, and Michael Stumm, commented on such an architecture in, "Computational RAM: A Memory-SIMD Hybrid and its Application to DSP," IEEE Proceedings of the Custom Integrated Circuits Conference, pp. 30.6.1-30.6.4, Boston, Mass., May 1992. Computational RAM (C-RAM), as the authors refer to it, is conventional RAM with SIMD processors added to the sense amplifiers. Bit-serial, externally programmed processors added only a small amount of area to a prototype chip of theirs. When such were incorporated in a 32M byte memory, the combination was capable of an aggregate performance of 13,000,000,000, 32-bit operations per second. Such chip is extendible and completely software programmable. The cited paper describes the C-RAM architecture, a working 8K bit prototype, a full scale C-RAM designed in a 4M bit DRAM process, and various C-RAM applications.
Duncan G. Elliott reported that he has a doctoral thesis in preparation at the University of Alberta on his website on the Internet, at "http://nyquist.ee.ualberta.ca/.about.elliott/cram". He describes his work as being related to C-RAM. Processors are incorporated into the design of semiconductor random access memory to build an inexpensive massively-parallel computer. Mr. Elliot states that if an application contains sufficient parallelism, it will typically run orders of magnitude faster in C-RAM than the central processing unit. His work includes architecture, prototype chips, compiler and applications. C-RAM integrates SIMD processors into random access memory at the sense amplifiers along one edge of a two-dimensional array of memory cells. The so-called "novel" combination of processors with memory allows C-RAM to be used as computer main memory, as a video frame buffer, and in stand-alone signal processing. The use of high-density commodity dynamic memory is claimed to make C-RAM implementations economical. Bit-serial, externally programmed processing elements add only slightly to the cost of the chip (9-20%). A working 64-processing element per chip C-RAM has been fabricated, and the processing elements for a 2048- processing element, 4M bit chip has been designed. The performance of C-RAM for kernels and real applications was obtained by simulating their execution. For this purpose, a prototype compiler was written. Applications are drawn from the fields of signal and image processing, computer graphics, synthetic neural networks, CAD, data base and scientific computing.
Single instruction multiple data (SIMD) machine systems often have 1,024 to 16,384 processing units that all may execute the same instruction on different data in lock-step. So, a single but very wide instruction can manipulate a large data vector in parallel. Examples of SIMD machines are the CPP DAP Gamma and the MasPar MP-2. Vector processors are generally regarded as SIMD machines, and examples of such systems include the Convex C410, and the Hitachi S3600.
When the bandwidth between memory and a vector processor unit (VPU) is too small, the VPU has to wait for operands and/or has to wait before it can store results. When the ratio of arithmetic to load/store operations is not high enough to compensate, performance suffers severely. Since it has been very expensive to design high bandwidth datapaths between memory and VPU's, compromises are often sought. Prior art systems that have the full required bandwidth are very rare, e.g., ones that can do two load and a store operation at the same time.
In 1996, Aad J. van der Steen and Jack J. Dongarra, both of Rice University, wrote that the majority of vector processors no longer employ caches because their vector units cannot use caches to advantage. Vector execution speed are often slowed due to frequent cache overflows. They also reported that all present-day vector processors use vector registers, even though in the past many vector processors loaded their operands directly from memory and immediately stored the results in memory, e.g., the CDC Cyber 205, ETA-10.
VPU's usually include a number of vector functional units, or "pipes" for particular functions. Pipes are also included for memory access to guarantee the timely delivery of operands to the arithmetic pipes and the storing of results in memory. Several arithmetic functional units are usually included for integer/logical arithmetic, floating-point addition, multiplication and/or compound operation. Division can be approximated in a multiply pipe. A mask pipe is often included to select subsets of vector elements that are to be used in vector operands.
Dynamic random access memories (DRAM's) are now the main type of memory device used in computer systems, at least in part, because their one-transistor per memory cell construction permits a maximum of memory storage to be designed onto a chip. Each memory cell uses a capacitor to store a voltage that represents a digital bit value. Because the capacitors are very small, a refresh must b e periodically performed to rewrite each bit. Otherwise, the information written in the memory is lost due to drifts and leakage that occur in such circuits. Most such DRAM's use circuits that unavoidably destroy the data in each memory cell when it is read out. Thus, a write-back cycle is needed to return the data to its original condition for other accesses.
It has been common practice in DRAM design to organize the memory cells into equal numbers of rows and columns, forming a square area on the chip die. A 1 M-bit DRAM is therefore roughly organized as 1K-by-1K, depending on the height and width of each cell. Access to such memory involves selecting whole rows where only a portion of the whole number of columns are manipulated at any one access. Row decoders are used to select which row in a memory core is to be accessed and column decoders are used to select the columns that match the system memory address. Sense amplifiers and latches are used to read and hold the data values in peripheral circuits, because the way the data are stored in the individual memory cells is incompatible with the external logic levels.
A principle reason that DRAM designers have been interested in reducing the power consumption of devices is to keep the heat dissipation to reasonable levels. With more than a million bits per DRAM chip now common, whatever power is dissipated in each memory cell is multiplied by a million or more for the whole chip. For example, Katsutaka Kimura, et aL, describe various power reduction techniques that are conventional in DRAM design in their article, Power Reduction Techniques in Megabit DRAM's, IEEE Journal of Solid-State Circuits, Vol. SC-21, No. 3, pp. 381-388 (June 1986). They seem to settle on using CMOS technology with half-Vcc precharge as their preferred solution for DRAM's over 1 M-bit.
Another similar discussion is by Kiyoo Itoh, et al. in Trends in Low-Power RAM Circuit Technologies, Proceedings of the IEEE, Vol. 83, No. 4, pp. 524-543 (April 1995). This article describes how lowering RAM memory power consumption can be helpful in portable battery powered equipment. The focus is on ways the charging capacitance, operating voltage, and DC static current can all be reduced to save on the overall power consumed by a RAM. A preferred method here for reducing power consumption is to use partial activation of multi-divided data-line and shared I/O circuits.
The popularity of portable computers and devices powered by batteries has been increasing. But batteries with very high energy storage capability continue to elude designers. So the answer to longer operational battery life is to draw less power for a given application. Thus, even in DRAM's where heat dissipation is not a problem, it is nevertheless important to reduce power consumption to extend operating time for portable systems because such a large portion of the overall system power is consumed by the DRAM's.