Multiple-port memory architectures are commonplace due to the current trend toward multi-processing and distributed processing. In general, such designs require at least a limited amount of shared memory for purposes of communication between processes. In such situations, the memory access time latency can increase proportionately to the number of sharing processes.
Practical memory implementation becomes progressively slower as the size of the array increases due to changes in the required technology. From fast static random access memory (SRAM), to slower but denser DRAM, to super slow magnetic or optical disc storage, the major price paid for increasing storage size is speed.
As memory densities increase, the relative system memory bandwidth deteriorates, even though the physical bandwidth may remain constant or show moderate improvement. The problem lies with practical packaging limitations which allow only a limited access to the memory elements. This hardware bottleneck prevents the use of higher density memory in applications where the decrease in package or pin count would adversely affect system bandwidth. For example: 4 1K by 8 bit memory chips used to implement a 32 bit memory application could not be replaced by a 4 k by 8 bit part without incurring a performance drop. In this case, the increased density of the part is secondary to the need for the 32 data lines.
It is an unfortunate fact that although increases in memory density have equaled or surpassed corresponding increases in logic density, the speed of practical memory arrays has not increased in proportion to the gains made in practical CPU execution cycle times. Since Von-Neuman type computers, in general, are memory intensive, due to their need for continuous instruction fetches from memory, the actual computer cycle time will approach that of the memory unless a memory design is implemented that provides a degree of leverage between CPU and memory hardware. The two most widely used techniques for increasing the memory performance are: interleaving and caching.
Leverage is usually generated through increased parallelism within the memory array itself. Interleaving is a method which allows more than one memory cycle to be active at any time by dividing the array into modules which are independently triggered by access requests. If organized to accommodate contiguous address accesses, any speed memory can appear to run at the CPU cycle speed, as long as the memory is used in a near sequential manner. A latency factor is present, however, that delays data from time of request to time of use by up to several physical CPU cycle times, and since an interleaved architecture still performs a memory cycle for each access request, multiple port implementations result in significant performance degradations.
Caches are also employed as a type of leverage device, usually in conjunction with a prefetch mechanism. The cache itself is a small to moderate size high speed memory array which provides buffering between the CPU and the main memory. An initial access by the CPU from main to memory causes the cache to generate a copy of main memory localized around the accessed location. Further accesses by the CPU will now have a high probability of coming from the cache rather than main memory, and thus the apparent speed of the memory has been improved. Additionally, the utilization of the main memory's bandwidth may increase since the interface between memory and cache runs asynchronous to the CPU's data requirements. Caches do not depend to the same degree as interleaving on the sequential nature of data accesses, however, their performance does depend on localized access of data. In addition, modification of data becomes a significant problem, particularly in a multi-processing/multiple cache architecture. The degree of cache flushing utilized in such an environment affects both the overall bandwidths of the system and, quite possibly, the operation of the software, since updated events may not become global immediately. In addition, it can be shown when data is accessed sequentially in large sets, such as video screen refreshes, the throughput rate of the cache architecture approaches that of the main memory (due to the lack of repeat accesses within a localized area).
While both interleaving and cache techniques are well accepted for general computer applications, many applications use memory in a highly predictable manner, allowing further leverage of memory bandwidth. Imaging applications provide considerable opportunity for memory bandwidth enhancement due to the large data sets typically operated as an integral unit. If an image is stored in memory as a contiguous segment of memory, only the beginning picture element (pixel) needs to be explicitly defined in order to access the total image. In such an application, algorithmic determination of future memory locations can allow multiple data words to be retrieved from the array in a single access and queued into high speed shift registers for rapid transmission. This concept has been integrated into currently available memory parts and are referred to as video RAM (VRAM).
The creation of memory architectures based on the VRAM principle allows the possibility of leveraging memory bandwidth beyond that of the cache or interleave implementations. Since multiple words can be manipulated by a single memory cycle, the memory bandwidth of a video RAM must be divided into two specifications: the memory cycle rate, which applies to random accesses into and out of the array itself, and the practical shift register input/output rate, which, while theoretically limited by the memory cycle rate, usually encounters a lower limit determined by the shift register hardware. An example is a typical 4 megabit/s VRAM with a 256 bit serial shift register. The memory cycle rate can support up to 1024 megabit/s, while an NMOS or CMOS serial shift register can support only 40-50 megabit/s. In this example, the part reaches a bandwidth limit well before the maximum memory cycle rate is achieved.
It is important to obtain memory structures that use the existing VRAM concepts, which, for graphic applications, can obtain actual bandwidths in excess of the memory parts used to implement the memory array. Bandwidth enhancement allows the use of a global memory in graphics systems which typically use distributed memory to decouple the bandwidth intensive screen refresh task from the image creation and rasterization tasks. A global memory, in turn, reduces the communication overhead between tasks, which for graphics applications can involve large data transfers (i.e., character fonts, display lists, rasterized images).
A typical graphics architectures separates the creation of an image into the following tasks: (1) the content of an image is composed of a combination of scanned imaged data and/or high level description languages; (2) the resulting image description is subjected to data compression techniques in order to reduce the amount of memory required for storage; (3) in order to display or plot the image, it is necessary to decompress the image description, and then interpret the high level description into a new description, composed of low level graphics primitives which are supported by the system hardware; (4) the graphics primitives are used to recreate the image in a bit mapped frame buffer; and (5) the completed frame buffer is used as input to the cathode ray tube (CRT) or plotter which reproduces the original image.
For maximum efficiency, the above steps are typically overlapped, allowing the display of one image while another image is being bit mapped and yet another is decompressed. To accomplish this overlapping or pipelining, distributed processing is used, with each processor operating on data generated by a previous processor. The used of distributed processing allows each hardware step to be custom designed to its intended task, mixing bit slice designs with microprocessors as needed.
Unfortunately, the use of distributed processing usually results in a distributed memory structure, so that each processing element has adequate memory bandwidth to support its associated tasks. The structure results in an overall performance degradation due to the need to transfer output data between physical memory segments.
If the above graphics system were to be implemented using a common multiple port memory, one with sufficient bandwidth to support the combined requirements of all of the distributed processes, the overhead necessary for intertask communications would be drastically reduced. Instead of transferring large data sets between memory sets, only memory pointers need be communicated, identifying the beginning of a contiguous block of data.
It is an object of the present invention to enhance the ability of a digital processing system to transfer large data sets between memory elements at reduced overhead.
It is an other object of the present invention to provide a digital processing system that allows data in the shift register to be transferred at a rate well above that of the memory array.
It is still a further object of the present invention to provide a digital processing system that enhances the ability of graphic systems to capture data, to store data, and to reproduce the original image data.
These and other objects and advantages of the present invention will become apparent from a reading of the attached specification and appended claims.