1. Technical Field
The present invention relates generally to data processing systems and in particular to input/output (I/O) mechanisms of a data processing system. Still more particularly, the present invention relates to a method and system for providing fully pipelined I/O Direct Memory Access (DMA) write operations via utilization of a DMA Exclusive cache state.
2. Description of the Related Art
A standard data processing system comprises one or more central processing units (CPU), one or more levels of caches, one or more memory, and input/output (I/O) mechanisms all interconnected via an interconnect. Traditionally, the interconnects utilized consisted primarily of a system bus and an I/O bus. In newer processing systems, however, particularly those with large numbers of CPUs and distributed memory, a switch is often utilized as the interconnecting mechanism.
In addition to the major components, data processing systems today are often equipped with an I/O controller, which controls I/O operations for the various I/O devices. More than one I/O controller may be utilized, each supporting particular I/O devices via an I/O channel, and the I/O controllers may be coupled to the interconnect via an I/O bus. Further, new processing systems typically comprise a plurality of paths (buses) for routing transactions between the I/O controller and the memory or distributed memory. Each path includes a series of latches, etc., and may each have different transmit times/latency based on the distance to/from the memory and number of latches, etc. Data is transmitted along these paths in a packet-like manner and each data packet may have different access latencies. Thus, in operation, data A written to a first memory or memory location may have a different access latency than data B written to a second memory or memory location if data A travels on a different path than data B.
Computer systems typically provide at least one system bus and a system memory area that is predominantly used by one or more processors for computation and data manipulation. I/O is sometimes performed by the processor. However, utilization of the CPU to perform input/output (I/O) transfers for these peripheral devices and subsystems places a burden on the CPU and negatively affects the CPU""s efficiency. Thus, Direct Memory Access (DMA) controllers have been provided in computer systems for off-loading transaction work from the CPU to a dedicated controller, in order to increase the availability of the CPU to perform computational and other tasks.
Each DMA operation is a specialized processor operation that transfers data between memory and I/O devices. The DMA transaction operates as a master on the I/O bus and is frequently a part of the I/O controller. When, the I/O controller completes the DMA task, the I/O controller signals (i.e., sends an interrupt to) the processor that the task specified is complete.
The DMA controllers free the processor from I/O tasks and usually perform transfers more efficiently. DMA I/O transfers can also be performed by the devices themselves. This type of device is referred to as a xe2x80x9cbus masterxe2x80x9d because it is capable of acquiring a bus and transferring data directly to and from memory or devices located on the bus.
The application software or device driver performs data communication with the device by writing or reading the data to or from memory and signaling the device or DMA controller to perform the transfer. A DMA transfer can also be performed from one device to another device using two discrete DMA transfers, one writing to memory, i.e., a DMA Write, and the second reading from memory, i.e., a DMA read. With a DMA Write, the input device data is transferred to memory from the input device by a DMA controller or by the input device if it is a bus master and the data is written to system memory.
The I/O channels provide input and output commands to and from peripheral components, respectively. Standard, logical operation of current processing systems requires that operations to memory be completed in the order in which they are received (i.e., sequential program order). Thus, the I/O channels operate as a First In First Out (FIFO) devices because the I/O writes to system memory from a device must be xe2x80x9corderedxe2x80x9d to the system memory. That is, for example, an I/O DMA Write command of a 128 Byte cache line A that is sequentially followed by an I/O DMA Write command of a 4 Byte cache line B has to be completed (i.e., data written) before the write of cache line B can begin execution. The write data B request is placed in the FIFO queue at the I/O controller and waits on the receipt of a completion signal from the write data A operation. The processor begins execution of write data B command only after receipt of a completion signal.
FIG. 2A illustrates a sample timing diagram by which the writes of data A and data B are completed according to the prior art. As shown, DMA Write A 201 is issued at time 0 (measured in clock cycles) and a corresponding snoop response 203 is generated and received several cycles later. When the clean snoop response 203 is received, often after several retries of DMA Write A 201, the acquisition and transmission of data A to the memory block is undertaken over the next few cycles. Then, the actual writing (storage) of data A 205 is completed over several cycles. Following the completion of the write data A 205, an acknowledgment 207 is sent to the processor to indicate the completion of the write data A operation. Once the acknowledgment 207 is received, the DMA Write B data 209 commences and takes several cycles to complete (see snoop response 211 and B data to storage 213). Data B is then stored in memory. Since no operation is issued to the I/O bus while the DMA Write data A operation is completing, the bus remains idle for several cycles and write data B 209 is held in the FIFO queue.
Once the write A command is issued, the processor waits for the return of a tag or interrupt generated by the successful completion of the previous write data A operation. When the tag or interrupt returns, this indicates that data A storage to memory is completed, and the CPU can then issue the read data B command.
The logical structure of processing systems requires that I/O operations be ordered in the I/O channel. Thus, the I/O channel must write the data to memory xe2x80x9cin-orderxe2x80x9d and also must wait until the successful completion of the previous operation before issuing the next operation. This waiting/polling is required because, as in the above example, if write B is issued prior to the completion of write A in current systems, write B would be completed before write A because of the smaller size of data B. This would then cause corruption of data and the corrupted data would propagate throughout the execution of the application resulting in incorrect results being generated and/or possibly a stall in the processor""s execution.
The long latency in completing some write operations, particularly those for large data such as data A, coupled with the requirement that the next operation cannot begin until after the completion of the previous write operation, significantly reduces overall processor efficiency. The present architectural and operation guidelines for processing systems that require the maintenance of the order when completing operations is proving to be a significant hurdle in development of more efficient I/O mechanisms. Currently, system developers are looking for ways to streamline the write process for I/O operations. Pipelining, for example, one of the key implementation techniques utilized to make CPUs faster, has not been successfully extended to I/O transactions because of the requirement that the previous data operation be completed prior to the next operation beginning. Current DMA transactions operate as single threaded transactions (or in a serialized manner), and there is currently no known way to extend the benefits of pipelining to DMA operations. One method suggested to reduce the latency is to move the I/O controllers closer to the I/O device thereby reducing the transmission time for acquisition of the data on the bus. However, because most of the latency in I/O transactions is tied to the wait for completion requirement and not the actual transmission of the data, these methods do not solve the problem of long latencies for I/O DMA operations.
The present invention recognizes that it would be desirable to provide a method, system and I/O processor coherency protocol that enables pipelining of I/O DMA Write operations. A method, system, and processor logic that enables a cache state that allows pipelining of serially provided DMA Writes in an I/O subsystem would be a welcomed improvement. These and other benefits are provided by the invention described herein.
Disclosed is a method and data processing system that provides pipelining of Input/Output (I/O) DMA Write transactions. An I/O processor""s operational protocol is provided with a pair of instructions/commands that are utilized to complete a DMA Write operation. The instructions are DMA_Write_No_Data and DMA_Write_With_Data. DMA_Write_No_Data is an address-only operation on the system bus that is utilized to acquire ownership of a cache line that is to be written. The ownership of the cache line is marked by a weak DMA ownership state, which indicates that the cache line is being held for writing to the memory, but that the cache line cannot force a retry of snooped operations. When all preceding DMA Write operations complete or each corresponding DMA_Write_No_Data operation has acquired the cache line exclusively for the DMA operation, then the weak DMA ownership state is changed to a DMA Exclusive state. The DMA Exclusive state causes a retry of snooped operations until the write transaction to memory is completed. In this way, DMA Writes that are provided sequentially may be issued so that their respective operations occur in a parallel manner on the system bus and their corresponding DMA_Write_No_Data operations may be completed in any order, but cannot be made DMA Exclusive unless the above conditions are satisfied.
Further, once a DMA Exclusive state is acquired, a DMA_Write_With_Data may be issued for each of the sequential DMA Write operations in the DMA Exclusive state. The DMA_Write_With_Data may then be completed out-of-order with respect to each other. However, the system processor is sent the completion messages of each DMA_Write_With_Data operation in the sequential order in which the DMA Write operations were received, thus adhering to the I/O processor""s requirements for ordered operations, while providing fully-pipelined (parallel) execution of the DMA transactions.
According to a preferred embodiment, weak DMA ownership is indicated by an affiliated cache state (D1). Likewise, DMA Exclusive is also indicated by an affiliate cache state (D2). A cache line transitions from D1 to D2 once DMA Exclusive ownership is acquired by the requesting process. After the cache line is written to memory, the D2 state transitions to either MESI Invalid or Exclusive states dependent on the system""s operational requirements.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.