The present invention relates to digital signal processing (DSP) and, more particularly, to the integration of processors and memory.
In recent years we have been witness to many advances in VLSI (very Large Scale Integration) technology. Minimum feature sizes on integrated circuits (ICs) continue to shrink, permitting dramatic improvements in processing speeds, reduced power consumption and increased functional density. Due to higher functional integration, new processing architectures for microprocessors (MPUs) and digital signal processors (DSPs) achieve higher performance by employing such techniques as VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data). Other improvements in integrated circuit fabrication technology have made much denser RAMs possible, and have brought forth new memory architectures that promise substantial improvements in memory access efficiency for certain applications.
Traditionally, memory chip architecture and fabrication techniques have been cost and volume driven, while processing architectures and fabrications techniques have been performance and speed driven. New and emerging applications of MPU""s and DSP""s tend to require massive high-speed, data arrays which require massive high-speed locally-connected memories. The traditional design goals for memories and processors have resulted in a few performance and configuration xe2x80x9cgapsxe2x80x9d between DSP and memory functionality:
Operating Frequency: Although clock frequencies for DSPs or MPUs are approaching 500 MHz, the maximum access times for RAM memories is only approaching 150 MHz. Hence, a typical DSP or MPU may be capable of processing and execution speeds three times faster than the RAM to which it must connect.
Data Bus and Address Bus width: DSPs and MPUs, being performance-driven architectures, have moved rapidly towards very-wide address and data buses. Memories on the other hand, particularly Dynamic RAM architectures, however, tend to be rather xe2x80x9cstingyxe2x80x9d with package pins, and have moved towards such techniques as minimizing pinout by multiplexing the address bus, which limits their performance and tends to complicate interface circuitry.
To overcome these performance and configuration gaps between memory architecture and processing architecture:, memory designers have devised a number of improvements to the external interfaces of RAM memories. Among the improvements that have been made are: Rambus DRAM (RDRAM), Sync-Link and Synchronous Graphics RAM (SGRAM). However, even these improved DRAMs have some important, limiting constraints on their usage:
DRAM is typically only available in huge binary multiple increments (e.g, 4 Mbytes, 8 Mbytes, 16 Mbytes, etc.). If xe2x80x9cHUGE_INCREMENTxe2x80x9d plus 1 byte is required for a particular application, the designer is essentially xe2x80x9cforcedxe2x80x9d into using double (twice) xe2x80x9cHUGE_INCREMENTxe2x80x9d amount of memory, and the remaining memory is wasted.
Cost is also a prevalent problem, and relates to memory granularity and architecture. For some applications, the size of general purpose RAM is not optimum. An example is when an application requires 4.2 MB of application specific memory and only 4M and 16 MB RAM are available. A 4.2 MB application specific memory module could cost less than two 4 MB RAMs or one 16 MB RAM if it were to be produced in sufficiently large volume to cover the development and production costs.
In an attempt to address these problems directly, there has been some research on the integration of a processor and DRAM onto a single chip. Most of this work consists of integrating the two functions (DRAM and processor) by using the fabrication process of one function and adapting the design of the other function to fit, for example by integrating an MPU function onto a DRAM process, by altering the multi-layer metal process of a processor to use the polysilicon-connected fabrication process of a DRAM. Unfortunately, this tends to adversely impact the processor""s performance, since polysilicon connection are inherently more resistive than metal interconnection layers, resulting in xe2x80x9cslowerxe2x80x9d circuits due to RC (resistive-capacitive) delay from the interaction between the polysilicon connection and on-chip parasitic capacitances.
Evidently, there is a need for a DSP or MPU with Embedded DRAM which is cost-effective and performs better than conventional DSP/DRAM or MPU/DRAM pairings.
The following documents, all of which are US patents, all of which are incorporated by reference herein, disclose various techniques having some relevance to the present invention.
U.S. Pat. No. 5,663,570 (September 1997) discloses a high-frequency wireless communication system on a single ultrathin silicon on sapphire chip. The devices are fabricated using conventional bulk silicon CMOS processing techniques. See also related U.S. Pat. No. 5,492,857 (February 1996).
U.S. Pat. No. 5,642,295 (June 1997) discloses systems utilizing a single chip microcontroller having non-volatile memory devices and power devices.
U.S. Pat. No. 5,634,108 (May 1997) discloses a single chip processing system utilizing general cache and microcode cache enabling simultaneous multiple functions.
U.S. Pat. No. 5,625,836 (April 1997) discloses SIMD/MIMD processing memory element (PME). Eight processors on a single chip have their own associated processing element, significant memory, and I/O, and are interconnected with a hypercube-based topology. Particular attention is directed to column 22 lines 54-55 of this patent, wherein it is stated (with reference to FIG. 2 of the patent) that xe2x80x9cwe combine both significant memory and I/O and processor into a single chip.xe2x80x9d As also described therein (column 20, lines 49-50), our device is a 4 MEG CMOS DRAM believed to be the first general memory chip with extensive rom for logic.xe2x80x9d See also related U.S. Pat. No. 5,588,152 (December 1996) which discloses advanced parallel processor including advanced support hardware.
U.S. Pat. No. 5,506,437 (April 1996) discloses a microcomputer with high density RAM in separate isolation well on a single chip. See also related U.S. Pat. No. 5,491,359 (February 1996).
U.S. Pat. No. 5,473,573 (December 1995) discloses single chip controller-memory device and a memory architecture and methods suitable for implementing same.
U.S. Pat. No. 4,942,516 (July 1990) discloses single chip integrated circuit computer architecture.
U.S. Pat. No. 4.734,856 (March 1988) discloses autogeneric system.
Unless otherwise noted, or as may be evident from the context of their usage, any terms, abbreviations, acronyms or scientific symbols and notations used herein are to be given their ordinary meaning in the technical discipline to which the invention most nearly pertains. The following terms, abbreviations and acronyms may be used in the description contained herein:
A/D: Analog-to-Digital (converter).
ALU: Arithmetic Logic Unit.
ASIC: Application-Specific Integrated Circuit.
bit: binary digit.
byte: eight contiguous bits.
CAM: Content-Addressable Memory.
CMOS: Complementary Metal-Oxide Semiconductor.
CODEC: Encoder/De-Coder. In hardware, a combination of A/D and D/A converters. In software, an algorithm pair.
CPU: Central Processing Unit.
D/A: Digital-to-Analog (converter).
DRAM: Dynamic Random Access Memory
DSP: Digital Signal Processing (or Processor)
EEPROM: Also E2PROM. An electrically-erasable EPROM.
EPROM: Erasable Programmable Read-Only Memory.
Flash: Also known as Flash ROM. A form of EPROM based upon conventional UV EPROM technology but which is provided with a mechanism for electrically pre-charging selected sections of the capacitive storage array, thereby effectively xe2x80x9cerasingxe2x80x9d all capacitive storage cells to a known state.
FPGA: Field-Programmable Gate Array g: or (giga), 1,000,000,000
Gbyte: gigabyte(s).
GPIO: General Purpose Input/Output.
HDL: Hardware Description Language.
IC: Integrated Circuit.
I/O: Input/Output.
IEEE: Institute of Electrical and Electronics Engineers
JPEG: Joint Photographic Experts Group
k: (or kilo), 1000.
KHz: KiloHertz (1,000 cycles per second).
MAC: Media Access Control.
Mask ROM: A form of ROM where the information pattern is xe2x80x9cDmaskedxe2x80x9d onto memory at the time of manufacture.
MCM: Multi-Chip Module.
memory: hardware that stores information (data).
M: (or mega, or MEG), 1,000,000
MHz: MegaHertz (1,000,000 cycles per second).
MLT: Multi-Level Technology.
NVRAM: Non-volatile RAM.
PLL: Phase Locked Loop
PROM: Programmable Read-Only Memory.
PWM: Pulse Width Modulation.
PLD: Programmable Logic Device.
RAM: Random-Access Memory.
RISC: Reduced Instruction Set Computer (or Chip).
ROM: Read-Only Memory.
SIE: Serial Interface Engine.
software: Instructions for a computer or CPU.
SRAM: Static Random Access Memory.
UART: Universal Asynchronous Receiver/Transmitter.
USB: Universal Serial Bus.
UV EPROM: An EPROM. Data stored therein can be erased by exposure to Ultraviolet (UV) light.
VHDL: VHSIC (Very High Speed Integrated Circuit) HDL.
An object of the present invention is to provide an improved technique for interfacing a DSP or CPU processor and memory.
Another object of the invention is to provide an efficient integration of DSP or CPU processor and memory on a single integrated circuit (IC) chip.
According to the invention, a DSP or MPU processor and DRAM are integrated on a single chip by utilizing one or more of the following techniques:
Since the architectures of a DSP or MPU are optimally configured for performance and those of DRAM for density and costs, radical changes in MPU (or DSP) and DRAM architectures and circuits are inappropriate. Accordingly, the present invention uses an approach that makes small but efficient changes to the processor and DRAM architectures and circuits, specifically:
Masking off one portion of the chip while one function (processor or DRAM) is fabricated, then effectively xe2x80x9creversingxe2x80x9d the mask to fabricate the other function (DRAM or processor), integrating each function using its xe2x80x9cnativexe2x80x9d process. Of course, this could potentially require a great number of process steps, resulting in a significantly higher fabrication cost than for other chips of the same size, and would do nothing to: address the functional/architectural performance gaps between processors and DRAM. To reduce the number of process steps,: common fabrication processes are preferably performed on both functions at the same time.
The following techniques address architecture-based performance improvements:
Organizing the DRAM on the chip in a wide word configuration. This is made possible by eliminating the need to conserve external pins. Since no external pins will be used to connect the DRAM to the processor, it is not necessary to conserve them. The processor is connected directly to the wide DRAM, thereby providing high bandwidth to the memory.
Eliminating address multiplexers and latches in the DRAM. Since the processor and DRAM co-reside on the same integrated circuit (IC) chip, the need for a multiplexed address bus is eliminated, and the processor""s address signals can be connected in parallel directly the DRAMs array addressing inputs.
Implementing DRAM xe2x80x9cwordxe2x80x9d lines in metal instead of polysilicon. This is made possible by exploiting the additional metal lines available in the fabrication process of the processor. The lower resistance of word line speeds up the performance of the DRAM circuit.
The density of the less performance-critical logic circuits in the processor can be increased by using multiple polysilicon interconnect layers available from the DRAM fabrication process.
These techniques take maximum advantage of the best characteristics of both technologies, and enhance the performance of the combination while making only minor architectural changes, if any, to either function (processor or DSP).
In digital signal processing (DSP) applications, two of the most widely used processing algorithms are the Discrete Fourier Transform (DFT) and the Discrete Cosine Transform (DCT). Both algorithms involve generating a series of sum-of-product terms resulting by multiplying of two matrices. A design that optimizes these functions requires two operands to compute the product and produces a result (sum-of-product) every cycle. Therefore, for optimal system performance, the processor/memory system of the present invention is configured to be capable of fetching two operands and storing a result during every clock cycle.
According to a preferred embodiment of the present invention, such a system is constructed by dividing the DRAM memory into at least three independent blocksxe2x80x94one for each of the two operands, and one for the result. By keeping the memory blocks (or xe2x80x9cbanksxe2x80x9d) separate from one another, all three memory operations (two fetches and one store) can occur in parallel (simultaneously).
According to an aspect of this preferred embodiment, DRAM can be divided into four independent blocks (banks), three for the aforementioned operands and result, the fourth, one being used as program memory, thereby permitting all three data accesses and an instruction fetch to be performed in parallel, without interfering with one another.
An integrated circuit (IC) employing the techniques of the present invention may be included in a system or subsystem having electrical functionality. Example systems may include general purpose computers; telecommunications devices (i.e., phones, faxes, etc.); networks; consumer devices; audio and visual receiving, recording and display devices; vehicle; etc. It is within the scope of the invention that such systems would benefit substantially from technique(s) of the present invention.