This invention relates generally to the field of computers and of transmitting commands to and from a computer memory on a bus, and more particularly, relates to giving priority to a threshold number of small commands and then allowing a large direct memory access command to be executed on the bus.
The basic hardware structure of a computer is a processor and a memory inside the computer and a number of input/output (I/O) ports with which to communicate with the world outside the computer. The I/O ports are usually attached to at least one I/O adapter which communicates with other computers and devices. Computers use specific protocols for communications between its internal operating system programs and the I/O adapters to transfer information to various I/O devices such as external disk storage, communications, network capabilities, etc. The I/O protocols are specific command and response messages exchanged on an I/O bus interconnecting the computer""s host processor and its memory called the host system to I/O adapters or I/O processors. An I/O processor can be considered a complex I/O adapter having more functions, usually in support of operating system programs and will be considered within the broader class of I/O adapters.
In I/O protocols, device driver programs in the computer""s operating system create command messages that are transmitted across an I/O bus to the I/O adapter. The I/O adapter interprets the command and performs the requested operation. This operation may transfer data between an I/O device connected to the I/O adapter and the computer memory across the I/O bus. Typically, data are transferred using known direct memory access (DMA) mechanisms that are I/O bus functions. When the I/O adapter has completed the requested operation, it responds back to the computer memory. The operating system and device driver programs interpret that response and conclude the overall I/O operation.
An example of an I/O bus is the Peripheral Component Interconnect (PCI) bus architecture which includes a host system with a host processor complex and a main memory connected to a plurality of I/O adapters via a PCI bus. The conventional PCI bus is a 32-bit bus and operates at 33 MHz with a peak throughput of 132 megabytes per second. One way to think about the bandwidth is to imagine a 32-lane highway with a 33 mile per hour speed limit and the throughput as a measure of the total traffic or data passing through that highway in a given time period.
The PCI-X is an updated PCI I/O bus specification released in late 1999 that breaks the xe2x80x9cone gigabyte per secondxe2x80x9d barrier in sustainable bandwidth for use in high-bandwidth applications such as Gigabit Ethernet, Fibre Channel, Ultra3 SCSI and high-performance graphics. PCI-X supports 32-bit and 64-bit operations at frequencies up to 133 MHz to allow the performance capability of over 1 Gbyte/sec data throughput. Carrying the highway analogy forward, the PCI-X bus can be considered a 64-lane highway with a speed limit of 133 mph, capable of carrying roughly ten times the traffic in a given time period compared with the conventional PCI bus.
The PCI-X bus specification, however, provides a number of challenges for bus designers. Unlike conventional PCI bus architecture which does not define or distinguish the specific communications about the content or type of information exchanged between a host system and an I/O adapter, all operations on a PCI-X bus have a length associated with them. Thus, typically for a PCI bus, it is very common to allow the data to flow into a buffer and when a threshold mark is reached, to start emptying the buffer to the other bus. While the same approach may work in PCI-X, it is grossly inefficient.
In the PCI/PCI-X specification, an I/O adapter typically includes a set of memory locations collectively be called a register set or a command buffer and a response buffer which are seen by the host processor as additional memory locations in its own memory space, i.e., the host system software xe2x80x9cmapsxe2x80x9d these PCI/PCI-X I/O adapter memory locations into the totality of the host system memory regions that are accessible using processor memory load and store operations. Thus, the typical host processor performs memory store operations to PCI/PCI-X I/O adapter memory locations to transmit a command on the PCI/PCI-X bus to a common buffer and performs memory load operations from I/O adapter memory to retrieve a response of status information on the PCI/PCI-X bus from the I/O adapter. Unlike processor store or load operations directed to actual host system memory, processor store or load operations to PCI/PCI-X I/O adapter memory locations usually require more time and are considered very time-expensive with respect to the host processor.
In response to the command, the I/O adapter typically performs the requested operation and then generates a response message to inform the host system of the result and any errors that have occurred. This response message is typically stored in the I/O adapter""s response message buffer and these response messages are typically small when compared to transferring large amounts of data across the I/O bus. The size of the response messages vary but typically they are less than 128 bytes and can be as small as four to eight bytes, depending upon the configuration of the operating system and memory. The host system then retrieves the response message and extracts protocol information from the retrieved response message to determine the I/O adapter""s response to the command. More particularly, the PCI/PCI-X host system reads the response message from an address in a memory of the I/O adapter to retrieve the response message. One consequence of such a PCI/PCI-X system is that the host system processor experiences latency because it must store the command to the I/O adapter memory and then load response data from the I/O adapter memory.
The execution of I/O commands by an I/O adapter typically requires a time duration that is many thousands, or even millions, of host processor instruction cycles. Thus, while the I/O adapter is performing a command, the device driver and computer operating system normally perform other work and are not dedicated strictly to waiting for the I/O adapter to complete the command and forward the response message. Rather, the typical device driver and operating system rely upon an asynchronous event indication, such as a processor interrupt, to signal that the I/O adapter has completed the command and that the response message is available for the operating system and device driver to interpret.
The relative timing and frequency of the signals to interrupt the processor have significant effects on the overall utilization of the host processor, utilization of the I/O adapter and its data throughput capabilities, and overall system performance. Such utilization is also affected by I/O command latency, or the duration of an I/O operation as seen by the programs that depend upon that I/O operation to complete their functions. In a large high performance processor system, the latency for an I/O memory read across a conventional PCI/PCI-X bus may require many, many processor cycles which seriously degrades execution speed of a program depending upon that I/O memory read. More particularly, a high performance processor attempting to do a single memory read of a four-byte response from a PCI/PCI-X device may experience a latency to complete that memory read of several hundred or even several thousand processor cycles.
The PCI/PCI-X local bus specification utilizes a mechanism that potentially alleviates some of these inefficiencies resulting from I/O latencies. This mechanism sets target latencies which limit the time in which the master, i.e., host system, the bus arbitrator, and the target, i.e., I/O adapter, must wait for responses. In practice, the PCI/PCI-X bus has a minimum latency based on its cycle time which is currently on the order of up to 133 MHz, so there are still guaranteed minimum latencies of several microseconds. Furthermore, the maximum, target latencies that the PCI/PCI-X standard would expect are typically on the order of many to several hundred microseconds. Potentially, for a slow I/O adapter that maximum latency could even realistically be upwards of a millisecond or even several milliseconds. The consequence to a high performance processor running with, for example, a seven nanosecond cycle time, is that, even at minimum expected latencies on a PCI/PCI-X bus, the processor is facing several hundred to several thousand cycles of time delay.
To optimize host processor utilization, conventional systems typically attempt to minimize the number of processor instruction cycles required to recognize the completion event and communicate this event to the I/O adapter device driver. To optimize I/O adapter throughput, conventional systems also attempt to minimize the time between the completion of one I/O command and the start of the next I/O command. To optimize overall system performance, in relation to programs that require I/O, conventional systems minimize the latency of an I/O operation, measured from the time the command is created until the time the response has been interpreted and the results are available to the program that caused or required the I/O, such as, for example, an xe2x80x9cOPEN FILExe2x80x9d function that requires a disk read operation to get information about the location of the requested file.
To accomplish these objectives, conventional I/O protocols also employ both command and response queues located in the computer main memory, I/0 adapter memory or registers, or a combination of both. Command queues enable the device driver to create new commands while the I/O adapter executes one such command. Response queues enable the I/O adapter to signal the completion of previous commands and proceed to new commands without waiting for the device driver or operating system to recognize and interpret the completion of these previous commands.
Similarly, computer systems generally include a processor interrupt mechanism which the I/O adapter uses to signal completion of a command and notify the host processor that a response message has been placed on the response queue. The interrupt mechanism provides a signal line from the I/O adapter to the processor that, when asserted, asynchronously interrupts the host processor and switches processor execution from its current program to an operating system or device driver program designed to interpret the interrupt event. While this interrupt mechanism can help optimize the latency associated with the completion of an I/O command and interpretation of the response message, switching the host processor execution from its current program to an interrupt program requires a processor context switch that requires many instruction cycles.
A context switch saves the current program""s critical information such as selected processor registers and state information and loads the interrupt program""s critical information. When the interrupt program completes its immediate work and is ready for the processor to resume the interrupted program, there is a second context switch to restore the critical information of the interrupted program which allows the processor to resume the interrupted program. Each context switch consumes valuable processor time. Because conventional systems interrupt the processor every time an I/O event has completed, context switches are relatively frequent and result in processor inefficiency.
Most host system PCI/PCI-X buses seek to increase the physical connections and possible I/O devices to the PCI/PCI-X bus to insure higher utilization of the PCI/PCI-X bus while minimizing the cost of these connections but it is impractical to provide many interrupt signals from every connection on the PCI/PCI-X bus. Thus, in practice many host systems limit the number of PCI/PCI-X bus connections which can either provide more than one interrupt signal or in which all or some subset of interrupt signals are connected to a single interrupt signal to the host system. Still, multifunction I/O adapters require increased host processor expense to interrogate individual I/O adapter functions to determine the source(s) of a PCI/PCI-X interrupt from the physical connection.
Memory write operations to retrieve data from an I/O adapter or PCI/PCI-X bus hardware require many host processor cycles to retrieve the data because the host processor waits for the loading operation to complete. Memory read operations which read commands from the host processor to I/O adapters and PCI/PCI-X bus hardware are not initially expensive in terms of host processor cycles, but the read command may not complete immediately and must either be verified via a write operation from the same PCI/PCI-X memory location or a series of processor write operations to verify the hardware between the central system processor and the I/O adapter. Memory read operations that require verification are commonly referred to as xe2x80x9cverifiedxe2x80x9d read operations. Memory read operations that do not require verification and that may be re-issued without adverse system effects are referred to as xe2x80x9cnon-verifiedxe2x80x9d read operations. Thus, to optimize the overall system performance and minimize processor utilization, it is necessary to balance expensive write operations from I/O adapters and also expensive xe2x80x9cverifiedxe2x80x9d read operations to I/O adapters.
Thus, the problem of any bus management scheme is how to manage the conflicting goals of maximum bandwidth, fairness of arbitration, and latency of data returns. Allowing data to be returned in the order it is received, while xe2x80x9cfairxe2x80x9d can penalize small data transfers with large latencies as the small transfer can get xe2x80x9cstuckxe2x80x9d behind a number of large, e.g., four kilobyte transfers. If the four kilobyte transfers finish before allowing the small operations to proceed, which is xe2x80x9cfairxe2x80x9d, the small operations wait a long time. Given that many small operations tend to be control type operations, e.g., fetching the next task to perform, and that latency which occurs while waiting on large transfers impacts how soon the I/O adapter can starts its next task, such an arbitration scheme, albeit xe2x80x9cfairxe2x80x9d, unnecessarily risks idling an I/O adapter. On the other hand, if a small operation is allowed to be presented to the bus as soon as available larger operations would be continually preempted resulting in poor bus utilization as large operations more efficiently utilize the bus.
The PCI-X specification requires a host bridge to split complete reads to system memory that will not be met within the initial latency time, i.e., the requested data is not immediately available in the I/O adapter and so while the data is being returned, the PCI-X bus will fulfill other responses. This requirement turns a host bridge into a completor, or master, in order to initiate a transaction to the PCI-X device upon receiving the data from the host system memory. This differs significantly from earlier PCI specifications where the host bridge would simply wait for the device to repeat its request and terminate the transaction by deserting FRAME/IRDY when its internal byte count has been satisfied. The problem for a host bridge in returning read data to the devices occurs when multiple devices exist under the host bridge. If a subset of the devices generate primarily short read requests while the remaining devices generate large read requests, bandwidth or latency are impacted.
If a host bridge simply returns the read request in the order received off the bus, devices doing short transfers, which tend to be timing critical, could suffer extremely long latency delays as there could be multiple read operations of four kilobytes or more reads in the FIFO ahead of the short read response. Alternatively, if latency is given priority and data is always returned as soon as it is received to minimize the latency, bandwidth on the bus will suffer as short data transfers are not as efficient as long data transfers.
The balance between maximizing the efficiency of a bus by allowing large data throughput and maximizing the efficiency of I/O devices waiting for a response or data from a host processor can be achieved by a method of selecting a data response to a command on a data transfer bus, with the method comprising the steps of simply selecting a data response to one of a plurality of commands, the selected data response waiting for the bus and being less than or equal to a threshold size; and then executing the selected data response on the bus. In a further embodiment, a number N of selected data responses may be allowed to accumulate and then the method will execute all of N selected data responses on the bus before executing other of data responses greater than the threshold size under normal arbitration on the bus. The bus may be a PCI/PCI-X bus. The maximum threshold size of the selected response may be what can be transferred in less than on the order of tens of bus cycles. In any event, the maximum threshold size is configurable to adapt to the bus speeds and processor speeds and particular applications; just as the number N of the selected data responses to be transferred is configurable.
The invention may further be realized by a method of selecting a data response to a command on a data transfer bus, comprising: selecting a data response to one of a plurality of commands, the selected data response waiting for the bus and being less than or equal to a programmable threshold size; allowing N number of selected data responses to accumulate; executing all of N selected data responses on the bus; and then executing other of the data responses greater than the threshold size under normal arbitration on the bus.
Objects and advantages of the invention may also be realized by an apparatus for transferring data, comprising: a host computer processor connected to a host memory on a host system bus; a host bridge connected to the host system bus, in which the host bridge comprises command queues to store commands, and buffers to store data associated with the commands, the data to be transferred according to the commands, and control logic to control the transfer of the data; and an I/O device connected to the host bridge on an I/O bus; wherein the data to be transferred either to/from the host computer processor from/to the I/O device is stored in the buffers and the control logic reviews the size of the data to be transferred and selects those of the commands associated with data less than a threshold size for transfer of the data. The data may be transferred on the host system bus. Alternatively, the data may be transferred on the I/O bus. The I/O bus may be a PCI-X bus.
The invention may further be considered an apparatus for transferring data on a bus in an information handling system, comprising: means to store at least one command associated with the transfer of data; means to store the command and the data to be transferred; means to evaluate the size of the data to be transferred; means to give priority to at least one command in which the data to be transferred is less than a threshold size and to execute the prioritized command and transfer the data on the bus before executing any other command to transfer data having a size greater than the threshold.
Further scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.