1. Field of the Invention
This invention relates to a device and method for performing high-speed block data transfers between modules connected to an input/output (I/O) bus in a computer system.
2. Description of the Prior Art
Several high-performance computer applications transfer a large volume of data between local memories on modules connected by a common, multimaster I/O bus (i.e., a global bus). In such an application, each module can act as a bus master, that is, it can temporarily take over the bus and directly read or write information in any other module which acts as a slave. Examples of such applications include the following:
1. Multiport LAN bridge or router. Each module connects to an external local area network (LAN). Packets arriving at each module are temporarily stored in a local memory on the module. Eventually, packets are forwarded to local memories on other modules, which then forward them to the destination LAN. PA1 2. Multimedia, multiclient file server. One or more modules connect to physical disk drives or other mass storage devices. Other modules connect directly to clients or to shared media (such as LANs) for communicating with clients. File transfers require data to be moved between local memory in the storage-device modules and local memory in the client modules. PA1 3. High-performance distributed-processing workstation. A workstation may contain several processors, each tailored for a specific task. For example, a workstation may have one processor for running an operating system or user "shell", another for performing three-dimensional graphics transformations, and yet another for managing a graphics display. One way to structure such a workstation is to provide a local memory for each processor and to move data from one local memory to another as required for each processor to use the data. PA1 1) The bus master sends a "read request" signal and an address to the slave. PA1 2) The slave reads the data from its memory at the specified address. PA1 3) The slave sends the data back to the bus master, along with an acknowledgment signal. PA1 1) The bus master sends a "write request" signal, an address, and data to the slave. PA1 2) The slave writes the data into its memory at the specified address. PA1 3) The slave sends an acknowledgment signal back to the bus master. PA1 1. Data transfer delay. Most applications require data to be transferred as quickly as possible, because the recipient of the data has nothing to do but wait until it receives the data. Examples are file transfers and packet transfers. PA1 2. Processing overhead. Data transfers on the global bus may delay unrelated processor operations on the sending and/or receiving modules, because the processor may need access to the bus in order to fetch and execute instructions. Examples are any module whose processor does not have a local instruction/data memory or cache. PA1 3. Bus contention. Even if modules have local memories or caches, they may be delayed at times if they need to use the global bus to access an I/O port or other bus-connected resource at the same time that the bus is being used for a block transfer. In addition, other, pending block transfers cannot even begin until the current one completes. For example, a module that is performing local processing may be blocked while trying to read a global flag or send a message on the global bus. PA1 Synchronous bus. A common clock signal, generated at a central point, is distributed to all modules connected to the bus. All control signals and responses are timed with respect to the clock signal. Likewise, data and address setup and hold times are specified with respect to the clock. Synchronous buses include Multibus, the ISA bus (part of the Industry Standard Architecture for the IBM PC/AT computer), and the recently proposed EISA (Extended Industry Standard Architecture) bus. PA1 Asynchronous bus. This bus has no common clock signal; bus timing is specified relative to the edges of control signals generated by the modules. The PDP-11 Unibus computer is an example from the minicomputer era. More recently, PA1 1. Increase the clock frequency. PA1 2. Increase the word length of the data bus. PA1 3. Decrease the number of clock periods per transfer.
Each of these applications simply requires blocks of data to be moved from one module's local memory to another's. In each case, the global bus provides the connection between modules. Either the source module becomes bus master and writes the block of data to the destination memory, or the destination module becomes bus master and reads the block from the source as follows.
In the prior art, when a bus master requests for a read operation to take place, address, data, and control information flows in two directions:
In the prior art, a write operation limits the two-way communication between the master and slave:
In general, high-performance applications require block transfers on the global bus to be made as quickly as possible, in order to minimize the following effects:
Many different computer I/O bus structures are known in the art. They can be roughly grouped into two categories:
Motorola adopted an asynchronous control approach in the 68000 microprocessor, which was then formalized in the VME bus.
A synchronous bus has simpler control logic and is the natural choice for single-processor (single-bus-master) systems in which the bus clock is simply the processor clock or a derivative of it. On the other hand, it is conventionally believed in the art that an asynchronous bus potentially gives better performance in systems with multiple bus masters.
The conventional argument for better performance with asynchronous buses is as follows. First, assume that the bus must support a wide variety of bus master types and operation speeds (since the processor technology keeps changing). Then, with a synchronous bus, each bus master and slave must synchronize with the bus clock, and a module-to-module transaction on the bus requires two synchronizations. Each synchronization requires an average of 50-100 ns to perform. Part of the synchronization time is the average delay until the next local-processor or bus clock edge (50 ns with a 10 MHz clock) and part of it is the metastability settling time for the synchronizing flip-flops (25 ns is needed with the very best flip-flops).
With an asynchronous bus, no synchronization to the bus or a fixed rate clock is required. Instead, each module on the bus is prepared to deal with asynchronous control signals at any speed up to a predefined maximum. When two processors communicate, one processor generates control signals synchronously with its local clock at its fastest available rate, and the other synchronizes with its own local clock. Only one synchronization occurs, and it takes place at the speed of the destination processor.
With a synchronous bus, total bus bandwidth is calculated as the product of the bus clock frequency and the number of data bits per transfer (word length), divided by the number of clock periods per transfer. Thus, there are three ways to increase the bandwidth of a synchronous bus:
For example, the ISA bus has a 6 MHz clock, a 16-bit data bus, and uses 3 clock periods for a typical transfer. Thus, its bandwidth for typical transfers is 32 Mbits/sec. It is desirable to have much higher bandwidths than this.