1. Technical Field
The present invention relates in general to data processing and, in at least one aspect, to input/output (I/O) communication by a data processing system.
2. Description of the Related Art
In a conventional data processing system, input/output (I/O) communication is typically facilitated by a memory-mapped I/O adapter that is coupled to the processing unit(s) of the data processing system by one or more internal buses. For example, FIG. 1 illustrates a prior art Symmetric Multiprocessor (SMP) data processing system 8 including a Peripheral Component Interconnect (PCI) I/O adapter 50 that supports I/O communication with a remote computer 60 via an Ethernet communication link 52.
As illustrated, prior art SMP data processing system 8 includes multiple processing units 10 coupled for communication by an SMP system bus 11. SMP system bus may include, for example, an 8-byte wide address bus and a 16-byte wide data bus and may operate at 500 MHz. Each processing unit 10 includes a processor core 14 and a cache hierarchy 16, and communicates with an associated memory controller (MC) 18 for an external system memory 12 via a high speed (e.g., 533 MHz) private memory bus 20. Processing units 10 are typically fabricated utilizing advanced, custom integrated circuit (IC) technology and may operate at processor clock frequencies of 2 GHz or more.
Communication between processing units 10 is fully cache coherent. That is, the cache hierarchy 16 within each processing unit 10 employs the conventional Modified, Exclusive, Shared, Invalid (MESI) protocol or a variant thereof to track how current each cached memory granule accessed by that processing unit 10 is with respect to corresponding memory granules within other processing units 10 and/or system memory 12.
Coupled to SMP system bus 11 is mezzanine I/O bus controller 30, and optionally, one or more additional mezzanine bus controllers 32. Mezzanine I/O bus controller 30 (and each other mezzanine bus controller 32) interfaces a respective mezzanine bus 40 to SMP system bus 11 for communication. In a typical implementation, mezzanine bus 40 is much narrower, and operates at a lower frequency than SMP system bus 11. For example, mezzanine bus 40 may be 8 bytes wide (with multiplexed address and data) and may operate at 200 MHz.
As shown, mezzanine bus 40 supports the attachment of a number of 10 channel controllers (IOCCs), including Microchannel Architecture (MCA) IOCC 42, PCI Express (3GIO) IOCC 44, and PCI IOCC 46. Each of IOCCs 42–46 is coupled to a respective bus 47–49 that provides slots to support the connection of a fixed maximum number of devices. In the case of PCI IOCC 46, the attached devices includes a PCI I/O adapter 50 that supports communication with network 54 and remote computer 60 via an I/O communication link 52.
It should be noted that I/O data and “local” data within data processing system 8 belong to different coherency domains. That is, while cache hierarchies 16 of processing units 10 employ the conventional MESI protocol or a variant thereof to maintain coherency, data granules cached within mezzanine I/O bus controller 30 for transfer to remote computer 60 are usually stored in either Shared state, or if a data granule is subsequently modified within data processing system 8, Invalid state. In most systems, no Exclusive, Modified or similar exclusive states are supported within data processing system 8 for I/O data. In addition, all incoming I/O data transfers are store-through operations, rather than read-before-write (e.g., read-with-intent-to-modify (RWITM) and DCLAIM) operations, as are employed by processing units 10 to modify data.
With the general hardware implementation described above, a typical method by which SMP data processing system 8 transmits data over I/O communication link 52 can be described as a three-part operation in which an application process, the operating environment software (e.g., the OS and associated device drivers), and the I/O adapter (and other hardware) each perform a part.
At any given time, the processing units 10 of SMP data processing system 8 typically execute a large number of application processes concurrently. In the most simple case, when one of these processes needs to transmit data from system memory 12 to remote computer 60 via I/O channel 52, the process first must contend with other processes to obtain a lock for I/O adapter 50. Depending upon the reliability of the intended transmission protocol and other factors, the process may also have to obtain one or more locks for the data granule(s) to be transmitted in order to ensure that the data granules are not modified by another process prior to transmission.
Once the process has obtained a lock for I/O adapter 50 (and possibly lock(s) for the data granules to be transmitted), the process makes one or more calls to the operating system (OS) via the OS socket interface. These socket interface calls include requests for the operating system to initialize a socket, bind a socket to a port address, indicate readiness to accept a connection, send and/or receive data, and close a socket. In these socket calls, the calling process generally specifies the protocol to be utilized (e.g., TCP, UDP, etc.), a method of addressing, a base effective address (EA) of the data granules to be transmitted, data size, and a foreign address indicating a destination memory location within remote computer 60.
Turning now to the operating environment software, the OS, following boot, performs various operations to create resources for I/O communication, including allocating an I/O address space separate from the virtual (or effective) address space employed internally by processing units 10 and creating a Translation Control Entry (TCE) table 24 in system memory 12. TCE table 24 supports Direct Memory Access (DMA) services utilized to perform I/O communication by providing TCEs that translate between I/O addresses generated by I/O devices and RAs within system memory 12.
Following creation of these and other resources, the OS responds to the socket interface calls of various processes by providing services supporting I/O communication. For example, the OS first translates the EA contained in a socket interface call into a real address (RA) and then determines a page of PCI I/O address space to map to the RA, for example, by hashing the RA. In addition, the OS dynamically updates TCE table 24 in system memory 12 to support DMA services utilized to perform the requested I/O communication. Of course, if no TCE within TCE table 24 is currently available for use, the OS must either victimize a TCE from TCE table 24 and inform the affected process that its DMA has been terminated, or alternatively, request the process to release the needed TCE.
In most data processing systems, the OS then creates a Command Control Block (CCB) 22 in memory 12 that specifies the parameters of the data transfer by I/O adapter 50. For example, CCB 22 may contain one or more PCI address space addresses specifying locations within system memory 12, a data size associated with each such address, and a foreign address of a CCB within remote computer 60. Following establishment of a TCE and CCB 22 for the data transfer, the OS returns the base address of CCB 22 to the calling process. Depending upon the protocol employed, the OS may also provide additional data processing services (e.g., by encapsulating the data with headers, providing flow control, etc.).
In response to receipt of the base address of CCB 22, the process initiates data transfer from system memory 12 to remote computer 60 by writing a register within PCI I/O adapter 50 with the base address of CCB 22. In response to this invocation, PCI I/O adapter 50 performs a DMA read of CCB 22 utilizing the base address written in its register by the calling process. (In some simple systems, address translation is not required for the DMA read of CCB 22 since CCB 22 resides in a non-translated address region; however, in higher end server class systems, address translation is typically performed for the DMA read of CCB 22). Adapter 50 then reads CCB 22 and issues a DMA read operation targeting the base PCI address space address (which was read from CCB 22) of the first data granule to be transmitted to remote computer 60.
In response to receipt of the DMA read operation from PCI adapter 50, PCI IOCC 46 accesses its internal TCE cache to locate a translation for the specified target address. In response to a TCE cache miss, PCI IOCC 46 performs a read of TCE table 24 to obtain the relevant TCE. Once PCI IOCC 46 obtains the needed TCE, PCI IOCC 46 translates the PCI address space address specified within the DMA read operation into a RA by reference to the TCE, performs a DMA read of system memory 12, and returns the requested I/O data to PCI I/O adapter 50. After possible further processing by PCI I/O adapter 50 (e.g., to satisfy the requirements of the link-layer protocol), PCI I/O adapter 50 transmits the data granule over I/O communication link 52 and network 54 to remote computer 60 together with a foreign address of a CCB within remote computer 60 that controls storage of the data granule in the system memory of remote computer 60.
The foregoing process of DMA read operations and data transmission continues until PCI I/O adapter 50 has transmitted all data specified within CCB 22. PCI I/O adapter 50 thereafter asserts an interrupt to signify that the data transfer is complete. As understood by those skilled in the art, the assertion of an interrupt by PCI I/O adapter 50 triggers a context switch and execution of a first-level interrupt handler (FLIH) by one of processing units 10. The FLIH then reads a system interrupt control register (e.g., within mezzanine I/O bus controller 30) to determine that the interrupt originated from PCI IOCC 46, reads the interrupt control register of PCI IOCC 46 to determine that the interrupt was generated by PCI I/O adapter 50, and then calls the second-level interrupt handler (SLIH) of PCI I/O adapter 50 to read the interrupt control register of PCI I/O adapter 50 to determine which of possibly multiple DMAs completed. The FLIH then sets a polling flag to indicate to the calling process that the I/O data transfer is complete.