This invention relates generally to computer systems and more specifically to a method for improving the performance of external device access.
As it is known in the art, multi-processor computer systems are designed to accommodate a number of central processing units, coupled via a common system bus or switch to a memory and a number of external Input/Output devices. The purpose of providing multiple central processing units is to increase the performance of operations by sharing tasks between the processors. Such an arrangement allows the computer to simultaneously support a number of different applications while supporting I/O devices that are communicating over a network and displaying images on attached display devices.
To enhance performance, all of the devices coupled to the bus must communicate efficiently. Idle cycles on the system bus represent time periods in which an application is not being supported, and therefore represent reduced performance.
A number of situations arise in multi-processor computer system design in which the bus, although not idle, is not being used efficiently by the processors coupled to the bus. Some of these situations arise due to the differing nature of the devices that are coupled to the bus. For example, central processing units typically include cache logic for temporary storage of data from the memory. A coherency protocol is implemented to ensure that each central processor unit only retrieves the most up to date version of data from the cache. Therefore, central processing units are commonly referred to as xe2x80x98cacheablexe2x80x99 devices.
However, external Input/Output (I/O) devices are non-cacheable devices. They typically do not implement the same cache coherency protocol that is used by the CPUs, although measures must also be taken to ensure that they only retrieve valid data for their operations. Typically I/O devices retrieve data from memory, or a cacheable device, via a Direct Memory Access (DMA) operation, in which data is retrieved in a large block. Typically I/O devices also store data to memory via DMA; when the block of data to be stored is less than a cache block the bridge in the coherent domain reads the block and modifies portions of the data, then writes it back to memory via a DMA as a large block. One mechanism used to ensure coherency is to place a xe2x80x98lockxe2x80x99 on the data block that is used by the I/O device. When a lock is placed on a data block, other cacheable devices in the system do not have access to that data block for the duration of the lock period. If the I/O device is only updating a portion of the block, then restricting the other cacheable devices from using that block results in unnecessary delay that reduces performance. Thus it would be desirable to provide a method for allowing communication between CPUs and I/O devices at increased performance levels.
Similarly, situations may arise in which one I/O device seeks to communicate with other I/O devices coupled to the system. For example, a graphics device or a network device may require data that is stored on a disk. If that device is coupled to the same I/O bus as the original device, then the transfer may be performed by straightforward transfer between the devices over the I/O bus.
However, typically in large multi-processor systems, there may be more than one I/O bus coupled to the system to accommodate more I/O devices. When an I/O device wants to communicate with an I/O device on another bus it must be accomplished via a system bus transfer. Typically, in such a situation, the I/O device issues a DMA transaction to the system, which stores the data in system memory temporarily. Then one of the CPUs issues an I/O write to transfer the contents of the system memory to the I/O device on the second I/O bus. Such an arrangement utilizes system bus bandwidth and CPU compute cycles in an undesirable manner.
A further performance problem arises as a result of system interrupts. Interrupts are a mechanism that are used by the system for indicating to the CPU that an event has occurred that requires attention or repair. Typically, interrupts are used for indicating to the CPU that a transaction has completed, that a service has been requested or, on rare occasion, for a hard or soft error at the I/O device. In addition, interrupts can be used to mark an occurrence of an event, such as the end of a time interval. When the interrupt event occurs, an interrupt signal is forwarded to the CPU. At the end of an instruction sequence, if the interrupt signal is asserted the CPU will halt execution of further instructions and service the interrupt.
Usually there are a number of interrupt event conditions, and each of the conditions is saved as one bit of an interrupt vector that is stored in an interrupt register. The occurrence of an interrupt event causes a signal to be asserted, and the signal assertion is logged in the appropriate location of the interrupt register. The interrupt signal is monitored by the CPU to determine which interrupts have occurred and their priority relative to the active process executing on the CPU.
If the interrupt is associated with the CPU, the interrupt register is readily available for examination and determination of the proper interrupt handling process. However, if the interrupt is associated with an I/O device the interrupt register is stored at the I/O device. The I/O device issues an interrupt signal to the I/O interface, which stores an interrupt status bit for each device. The CPU must periodically examine the interrupt status register of the I/O interfaces to determine which device had an interrupt. The CPU then fetches the interrupt vector from the indicated I/O device and handles the interrupt. This process for determining interrupt conditions suffers performance disadvantages because valuable compute cycles are wasted while the CPU fetches the interrupt vector.
Accordingly, it can be seen that there are a number of situations that may arise during the operation of a multi-processor computer system that decrease the efficiency of system bus. Therefore it would be desirable to determine a method or apparatus that would provide increased multi-processor performance through improved utilization of system bus bandwidth.
According to one aspect of the invention a method for communicating between at least one non-cacheable device and a multi-processor computer system is described. The multi-processor computer system includes a first memory and a plurality of cacheable devices coupled by a bus or switch, where the cacheable devices are capable of temporarily storing and modifying data from the memory. The non-cacheable devices are also coupled to the bus. To provide communication between a non-cacheable device and the multiprocessor computer system, the following steps are issued: the non-cacheable device issues, on the bus, a request for write access to data from the memory. The cacheable devices monitor the bus and check the request to each determine whether they have stored the latest version of the requested data. If one of the cacheable devices determine that they are storing the latest version of the requested data, that cacheable device issues an indicating signal to the devices coupled to the bus. In response to the indicating signal, the non-cacheable device issues a read/modify command. If none of the cacheable devices determine that they are storing the latest version of the data, the non-cacheable device simply completes its access to memory.
With such an arrangement, existing cache coherency logic may be used to increase efficiency of write operations by non-cacheable devices. The integrity of the data block is ensured by the atomicity of the I/O operation and an appropriate invalidation protocol. As a result, performance is increased for partial cache line writes by I/O devices.