1. Field of the Invention
The present invention is generally related to an input/output (I/O) controller in a computer system, and in particular to a method efficiently allocating direct memory access (DMA)-read slots in the I/O controller.
2. Description of Related Art
Multi-core processors increase pressure on the memory subsystem causing more simultaneous requests to the memory controller and thus necessitating deeper queues inflicting higher latencies. I/O controller initiated data-push is one option for reducing latency in the I/O path of future systems. In future systems, dedicated hardware may be present in the processing units and the I/O controller to push payload data into I/O devices. Thus the latency created by going back and forth over an external bus such as HyperTransport or GX may be reduced.
At the same time, the effects of data origin of direct memory access (DMA)-read data fetches in the processor interconnect are getting more and more important. However, in the current systems the architecture of the processor interconnect in terms of resources and latencies cannot be taken into account by the I/O devices. This is due to the I/O devices lacking necessary information on the data origin of DMA-read data fetches. The I/O devices also need to be independent of the processor architecture to be usable on different architectures and provide standardized external bus protocols such as Peripheral Component Interconnect (PCI).
Currently, the I/O controllers use the same techniques as processing units in handling DMA-read data fetches in snooping-based and directory-based cache coherent systems due to the need to interfacing with the processor interconnect. However, the I/O devices' requirements for executing DMA-read requests differ substantially from that of the processing units. For example, the processing units require in-order execution of read requests and extensively use inexact pre-fetching. Many times, even though these requirements increase cache hit rates, they cause lots of overhead on the processor interconnect and increase latency for DMA-read requests. On the other hand, the I/O related DMA-read requests can initiate exact prefetching, but the order of execution does not strictly need to be in-order. If possible, re-ordering can be used to optimize latency and external bus bandwidth.
I/O controller serves as a bridge between the processor interconnect and the external bus. Today, many I/O controllers use 1-to-1 request-to-slot mapping in handling DMA-read requests. The I/O controller features a number of DMA-read slots, each of which is responsible for handling one cache line (CL) on the internal processor interconnect. The processor interconnect may for example be a snooping based interconnect or a directory cache. I/O devices interface with the I/O controller DMA-read slots using either a low level protocol through an external bus, such as GX or HyperTransport, or through an intermediary such as a PCI-Host Bridge.
When DMA-read data is requested, the I/O device or the intermediary issues a request and is granted a credit for fetching data of one cache line associated with a DMA-read slot. The I/O device or the intermediary does not have knowledge of where the requested data is located in the processor interconnect, thus it cannot optimize the use of the external bus by re-ordering the sequence of execution of the requests. On the other hand, the processor interconnect does not have knowledge when further requests on consecutive cache lines may follow suit in the near future, which if taken into account, may help in reducing access latencies caused by repeated coherency policy enforcement.
The known I/O controller uses a 1-to-1 mapping scheme between the DMA read requests and the DMA read slots. The DMA read slots in the I/O controller are connected to the processor interconnect of the processor interconnect. The DMA read slots are connected with an arbitration unit. When a DMA read request is submitted to the I/O controller, a DMA read slot is directly connected to the issuer of the request, i.e. the I/O device or an intermediary device sends a request to the processor interconnect to fetch the data. The requested data is fetched using the processor interconnect and buffered in the DMA read slots. When multiple DMA read slots have data for transfer, the arbitration unit 140 determines the order in which the data is to be transferred on the external bus. The arbitration unit may be either directly connected with an I/O device through a low-level protocol such as a GX or HyperTransport bus protocol, or be connected with an intermediary device such as a PCI-Host Bridge.
For each request, the system provides one credit that can fetch one cache line. In the architecture of the known systems, the I/O devices do not have any knowledge about the would-be origin of the requested data, i.e., whether the data is in the memory or in the cache of a processing unit or in a victim cache. Similarly, the requestor does not provide any information to the I/O controller about whether its requests would require consecutive cache lines.