1. Technical Field
The invention relates to fetching data and instructions from a hierarchical memory where portions of the data are stored in a main memory and are transferred to faster intermediate memory between a requester and the slower main memory, and more particularly where the selected data items are retrieved from the slower main memory into a cache or buffer, that is, intermediate memory, prior to any request from the requester for the particular item of selected and prefetched data. A further aspect of the invention is an interface architecture that couples two or more buses to one another through a bridge including functions for controlling bridge operations and prefetching data.
2. Description of Related Art
It is frequently necessary to transfer large amounts of data across a data bus by a read action. Many times the protocol of the data bus or data channel limits the maximum size of the data chunk transferred to a size that is less than the amount of data needed by the requesting agent. Other latencies are introduced by, for example, the processes of requesting the data, locating the data, and making the data available for movement across the data bus or data channel, as well as fairness for servicing multiple data requests.
Input/output (I/O) processors typically read data from main memory in multiple byte blocks. Accessing these multiple byte blocks of data from memory is usually slower than the speed of the requester, causing the requester to wait for the data. This is the situation where there is a plurality of remote agents requesting data from the same memory through the same memory controller; the requests are intercepted by the memory controller, and sent by the memory controller to the memory as requests for packets of data. The requesting and packetization, as well as the queuing of the requests, packetization, and packets, introduce latency.
Beyond the speed of execution of individual steps in a memory operation (arising from, for example, device level issues), a significant component of latency is the number of memory fetches to get a data chunk from main memory to a data requester. For example, memory reads and fetches may occur through a Fibre Channel interface across a peripheral component interconnect (PCI) or peripheral component interconnect—extended (PCI-X) type bus.
The PCI system is an interconnection system between a microprocessor and attached devices in which expansion slots are spaced closely for high speed operation. A newer version of the PCI interconnect is the PCI-X interconnect. This is a computer bus technology (the “data pipes” between parts of a computer) that increases the speed that data can move within a computer from 66 megahertz (MHz) to 266 MHz, for example through a PCI—double data rate (PCI-DDR) connection. Specifically, PCI-X interfaces increase the performance for high bandwidth devices such as Gigabit Ethernet cards, Fibre Channel, Ultra3 Small Computer System Interface, and processors that are interconnected as a cluster.
Fibre Channel is a point-to-point, switched, and loop interface between servers and clustered storage devices, and, depending on the type, is faster than Small Computer System Interface (SCSI). It is designed to interoperate with SCSI, the Internet Protocol (IP) and other protocols. Standards for Fibre Channel are specified by the Fibre Channel Physical and Signaling standard, and the American National Standards Institute (ANSI) X3.230-1994, and International Standards Organization (ISO) 14165-1 standards.
The Fibre Channel adapter reads the main memory where an associated bridge serves the read request. A bridge is a hardware device that is used to connect different protocols or subsystems so that they can exchange data. Bridges can work with networks, devices, and subsystems that use different wiring or network protocols, joining two or more local area network (LAN) segments to from what appears to be a single network. Bridges are also used to connect I/O chassis to increase a computer's I/O capability.
The bridge acts like an initiator on one side (typically the SCSI side) and a target on the opposite side. The targets are selected by mapping the appropriate SCSI values into the target field and correlating a Fibre Channel logical unit number (LUN) value to a Bus:Target:LUN value. A LUN is a logical unit number that is a unique identifier used on a SCSI bus that enables it to differentiate between a plurality of separate devices (each of which is a logical unit). Each LUN is a unique number that identifies a specific logical unit, which may be an end user, a file, or an application program. The bridge hardware resides on a PCI or PCI-X card.
A critical latency issue arises because of bandwidth limitations in the PCI bus. This means that main memory has to be read inefficiently in many small chunks rather then efficiently in larger but fewer chunks. There are a lot of inefficient small reads of contiguous memory. And since memory may be serving multiple I/O adapters, it is important that it be used efficiently. That means that overall throughput will suffer if a read is done, but the data is not used.
Moreover, many software programs do not exhibit classical locality of reference behavior and/or the data sets they operate upon are larger than the cache size. As a result, cache misses increase and cache hits decrease. This illustrates one problem with traditional cache memories. Prior art cache memories are dependent on the temporal and spatial locality of data. As a result, the locality based cache memory paradigm often fails to work effectively in memory-access patterns that are lacking in conventional spatial or temporal locality. This, in turn, significantly reduces the performance of the requester. This problem is observed in large-scale scientific and technical computing where memory access is not strictly local but tends to be made in sequence to arrayed data with little data reused. This problem is also observed in many large business systems such as credit card processing or supply chain management, where memory requests are sequential.
One suggested solution is software prefetching. To reduce the cache “miss” rates, some computer systems utilize prefetch algorithms. When the requester reads data, the data associated with the successive addresses is also fetched and stored in cache. For example, if the requester request addresses A0–A7, addresses A8–A15 will also be fetched from memory. The prefetch algorithm increases the “hit” rate of the subsequent read request from the requester. Software prefetching has been used to transfer data from main memory to a cache memory in advance of a memory call. However, when list access is made to a data array, and in the case of programs written in an object-oriented language, the software frequently fails to properly insert the prefetch instruction. This is true even if the memory-access pattern is sequential.
Another alternative is hardware prefetch. Hardware prefetch includes one or both of: (i) making a hardware prefetch of a data stream which has already been prefetched once, or (ii) making a hardware prefetch if the difference between the address of the past memory access and the present memory access falls into a prescribed range.
In the case of a hardware prefetch of a data stream which has already been prefetched once, the hardware prefetch is ineffective for data streams which have yet to be prefetched. In the case of a hardware prefetch where the difference between the address of the past memory access and the present memory access falls into a prescribed range, the address of data to be prefetched is generated by adding the interval of the address to the present access address. However, this hardware prefetch often fails to eliminate the latency in data transfer from the main memory to the cache memory.
As described above, because instructions are scheduled in a requester with a built-in cache memory based on an assumption that the latency of the cache memory is short, processing performance falls significantly if a cache miss occurs. Cache misses often occurs in sequential memory-access patterns.
Thus, a clear need exists for an intelligent bus or bus bridge with memory and logic, where the intelligence eliminates the many small reads of contiguous memory, reading a bigger chunk of contiguous memory in a single read and storing the reads in cache memory associated with the intelligent bus or bridge as prefetched memory.
A still further need exists to reduce both the latency associated with slower device physics and more and slower process steps in the main memory, as well as latency associated with the additional process steps in accessing the main memory from the bridge.
A still further need exists for a method and an apparatus in the data bus or channel, for example, a bridge device or subsystem, to interact with the data bus or data channel, and at the source of the data, to prefetch the data and to make the prefetched data ready for transfer of data as a function of past requests for data.