The present invention relates to computer memory architecture and, more particularly, to high speed, multi-ported, direct data flow memory architecture.
Many computer I/O systems utilize a buffered data flow protocol. With such architecture, data is moved into the system through one interface, is buffered temporarily, and is then moved out of another interface. Often the data path is separate from the command/control path, but it need not be.
To increase overall system bandwidth, it is highly desirable to move the data over a given system bus or data pathway only once. Multi-ported memory architectures are used to buffer data, while satisfying this single pathway requirement. Data is input through one port and output through another.
The most common memory of this type is dual-port, but higher numbers of ports are also used. The memories may be implemented to function with component-level buses or with system-level buses, such as PCI. The latter has become a preferred component interconnection means due to the large numbers of personal computers using the bus, and the peripheral devices supporting it.
Multiple port memory architectures are generally implemented today in one of two ways. The first method uses memory composed of true multi-ported memory cells and is found in certain static random access memories (SRAMs). The second method uses a single-ported memory array (typically DRAM of some sort) with a multiplexing scheme that alternately permits one of several ports to access the memory one at a time. If the speed of the memory is significantly higher than that of the ports, and a speed-matching (synchronizing) mechanism is provided, then the memory may appear to be simultaneously accessed by multiple ports. This method is used by most PC bridge chip sets to multi-port a single bank of memory to a processor, a system I/O bus and a video bus.
A true dual-ported SRAM structure has certain advantages, among which are:
a) random accesses may occur on each port simultaneously without causing any access delay at the other port;
b) the initial access to a sequential block of memory takes the same time as do all subsequent accesses, so there is no initial access overhead time penalty; and
c) dual port SRAM designs are inherently simple, since multiplexing is not necessary.
Unfortunately, true dual-ported SRAM memory systems have certain disadvantages, also, among which are:
a) the memory is limited to only two ports;
b) the density of the memory is relatively low, typically requiring many chips per megabyte; and
c) the cost of the memory is very high, typically an order of magnitude more than the cost of DRAM.
The advantages and disadvantages of a multi-ported DRAM architecture, on the other hand, are opposite those of the SRAM. When an application requires more than 100 Kbytes of buffering, the cost of multi-port SRAM becomes prohibitive.
If more than two ports are required, a multiplexing scheme must be implemented, raising the overall cost of the system. In these cases, the most cost effective method of implementing a multi-ported RAM is by using a multiplexed DRAM architecture.
In typical operation, a device reads or writes data in blocks to and from the buffer memory. The size of these blocks varies according to the application. A memory controller usually performs memory accesses in fixed sizes, with the width and depth of the access set to a predetermined burst size. On a read from memory, when an external device uses more than one memory burst of data, the memory controller performs additional accesses as necessary. When the external device uses less than one burst of data, the additional data fetched from memory is discarded.
On a write to memory of one or more full bursts, the controller performs the write operation. If the write is only to a portion of the fixed size block in memory, the controller reads the data block from memory, modifies the correct portions, and then writes the block back. Additional memory bandwidth may be required if complex caching is not performed.
DRAMs typically require several addressing and start-up cycles to begin an access to a consecutive block of data. This overhead requirement is therefore fixed, regardless of the data read or written. As a result, the effective data bandwidth increases with larger data transfer bursts, due to the overhead being averaged over a greater amount of data actually moved. For a single memory port, the longer the burst at a given clock speed, the greater the available bandwidth of the port. When implementing a multi-port memory, however, the metric (parameter) of interest is not the raw bandwidth of a single port. Instead, it is the net bandwidth that is important, across a pair of ports when one port writes data to memory, and the other port reads data from memory. In this situation, a long burst on one port adds an access delay to the other port, which results in additional overhead on that other port. Of course, if the data bursts are always very long, this overhead may be small compared with data volume, resulting in acceptable performance.
If the application does not require large data bursts or uses a mix of large and small bursts, a large memory access size causes a highly inefficient use of the bandwidth, resulting in an unacceptable performance. This inefficiency arises from discarding great amounts of unused data on reads, and reading and then writing great amounts of unmodified data on writes. Reducing the memory access size is advantageous; but, without additional mechanisms for intelligently mapping the variable sized device data transfers to the smaller fixed memory accesses performed by the controller, poor performance results.
As a numerical example, consider a memory with a 6-clock cycle access overhead (i.e., the time from initial request to first data is six clock cycles at the memory clock speed). If two ports request access simultaneously, then one must wait for the other to complete. Assuming a 66 Mhz clock, 32-bit memory width, one clock cycle of port arbitration and the first port winning access, Table I indicates the data bandwidth of the ports as a function of the memory burst size for one burst.
As can be seen, the actual dual port throughput, which is the amount of data that can be moved into port 1 and out of port 2, is lower than the peak rates of either port. These rates do not include any overhead for the PCI busses to which they are connected. It can also be seen that the throughput levels off and does not approach the 132 MB/sec that a 32-bit PCI bus can sustain. Increasing burst size alone cannot deliver high multi-port throughput.
The throughput numbers shown above also assume that all of the data read from the memory is used. If only a small fraction of the fixed size memory burst data is actually used, then the fixed-size larger bursts may actually waste memory bandwidth; the effective throughput decreases to a greater extent. If the data required and actually used were 512 bytes (128 32-bit words) and the memory burst size were set to 256 32-bit words, for example, then the effective throughput would be one half of that shown. If all of the PCI transfers are not at least as large as the memory burst, then bandwidth is wasted and effective throughput decreases.
Another problem arises when PCI-delayed read transactions are performed. The PCI specification requires a target to issue a retry instruction if the latency for a read access is significant (i.e., exceeds a maximum preset number of bus clock cycles). With larger and larger memory bursts on the other ports of the memory, the latency for a given port to gain entry to the DRAM grows, forcing the target to issue retries. In these bus transactions, a data read is first posted at the interface, and the target interface issues a disconnect instruction. The target then prefetches the read data and waits for the master to retry the data access. When it does, the data is provided.
Whenever the number of PCI clock cycles required for a retry plus those required to restart the access is less than the number required to access the first data of a burst, there is some advantage to these split transactions. However, if the PCI accesses are performed in small bursts, then this rule is not met, and throughput suffers more. As an example, consider a PCI burst of 128 bytes (32 words of 4 bytes each). Assuming one other intervening transfer, the bus activity is as follows:
Device 1
1 clock cycle of arbitration,
1 clock cycle for address,
1 clock cycle for turn-around,
1 clock cycle to receive the retry,
1 clock cycle bus idle;
Device 2
1 clock cycle of arbitration,
1 clock cycle for address,
1 clock cycle for turn-around,
1 clock cycle to receive the retry,
1 clock cycle bus idle;
Device 1
1 clock cycle of arbitration,
1 clock cycle for address,
1 clock cycle for turn-around,
32 clock cycles to receive the data,
1 clock cycle bus idle.
The total time is 46 clock cycles which, at 33 Mhz, is 92.7 MB/sec. This assumes that the memory access is less than 10 PCI clock cycles, including the worst case access time of the other port into the memory. If the memory were to take one additional clock cycle to access the first data, then another retry would be issued, and another 6 clock cycles would be added to the transfer time, reducing throughput further. It can thus be seen that the ports are coupled. Larger memory bursts on one port introduce larger latencies and more retries on the other ports, which result in lower throughput.
It would be advantageous to provide a multi-port memory architecture optimized for both large and small bus data packets.
It would also be advantageous to provide a memory width and burst depth determination algorithm based on the number of ports, host data bus width and burst length, and host bus clock rate that would guide the design of a memory system implementing the architecture.
It would further be advantageous to provide an adaptive read pre-fetch algorithm for use with a multi-port memory architecture that would increase the efficiency of memory use and the obainable memory throughput. It would still further be advantageous to provide a memory data masking method to eliminate the need for read-modify-write cycles to alter portions of the wide memory words used in this architecture to give the other advantages listed without the speed penalties of read-modify-write cycles.
In accordance with the present invention, there is provided a high speed, multi-ported, direct data flow memory architecture that permits memory width and basic transfer speed greater than system bus width and transfer speed to allow shallow burst depth and reduce other-port latencies, while maintaining high multi-port throughput. The inventive system has a data storage device (SDRAM) and a multiplexer connected to the SDRAM. The multiplexer provides matching between the memory data width and the port data width. Two or more interfaces or ports are provided with data sourcing controllers respectively connected to the interfaces. A communications bus connects the SDRAM to the data sourcing controllers for facilitating data communications. A FIFO buffer memory is located between the multiplexer and the data sourcing controllers. The FIFO buffers act as temporary data storage, and match the rate of data transfer of the port to the higher rate of the memory. The need for retries is eliminated by making the total amount of time required by the memory controller to satisfy the other ports"" requests less than the time-out value that would require a retry to be generated. Read-ahead algorithms are provided that adapt the larger system bus burst sizes to the smaller memory burst sizes, without creating the need for additional retries, and with the ability to cancel unneeded advance requests for data. The total memory bandwidth is greater than that of the sum of the ports, so that the small memory burst size inefficiency does not reduce throughput below acceptable levels. Write data is selectively masked to eliminate the need for read-modify-write cycles. Reads and writes can begin and end on arbitrary byte addresses, regardless of memory or bus widths.