The present invention covers a pipelined floating-point load instruction which may be implemented within the bus control unit of a microprocessor. The microprocessor utilized with the present invention is the Intel 860.TM. Microprocessor, frequently referred to as the N10.TM. processor. (Intel is a registered trademark of Intel Corporation).
The N10 processor is a 32/64-bit IEEE compatible floating-point processor, a 32-bit RISC integer processor and a 64-bit 3-dimensional graphics processor. Using a numerics processor optimized for both vector and scalar operations, it represents the industry's first integrated high performance vector processor incorporating over one million transistors and providing about 1/2 of the performance of the Cray1, all on a single chip.
All processors have some form of load instruction that can access information from either an external memory or an internal data cache. Access to externally stored data is usually made via an external data bus controlled by the internal logic of the processor. The rationale of using a data cache is to provide efficient access to frequently used information, thereby accelerating processing speed. In processors that utilize a data cache, normal load instructions will operate most efficiently if the data information is resident in the onchip cache. That is, if the data is not in the cache there is a penalty in performance when accessing the data.
Typically, when external data is referenced using a normal load instruction it is stored in the cache. The reason for this is that, under normal conditions, data which has just been referenced is very likely to be referenced again in the near future. The data access penalty is minimized by providing the most frequently accessed information in the internal data cache while reserving external memory for seldomly referenced or reused information. It is the principle of locality which makes the data cache a useful tool since programs tend to reference certain data repeatedly in the near future.
A problem arises however, when a processor is required to deal with very large data structures or, in any event, data structures that are much bigger than that which the data cache can normally hold. As an illustration of the difficulty that can arise, a processor is often required to perform a variety of floating-point operations, such as matrix inversion, multiplication, etc., which require manipulation of huge data matrices. In prior art processors when the data is not in the onchip data cache, the processor must freeze execution and request access from external memory. During the time that execution is frozen, the processor is prevented from issuing any new addresses to memory. In other words, the processor must wait for the data for the first operation to arrive from external memory, before continuing its operations. As a result, this type of access to external memory can take six clock cycles or longer. Thus, a substantial delay is introduced into the processing speed of the system when frequent access to external memory is mandated by the size of the data structures involved.
Another problem related to the handling of large data structures arises when the externally accessed data is brought into the processor. As external data is delivered to the processor, it is written into the cache--usually replacing previously resident data. However, it should be remembered that some external data (most commonly in the case of large data structures) is infrequently referenced information, i.e., it is not expected to be reused, while the replaced data in the cache is information that is very likely to be referenced repeatedly in the near future. Therefore, the processor is tossing out data that needs to be reused in favor of data that will in all likelihood only be referenced once. As a consequence, an inordinate amount of time is spent recalling the replaced cache data. This increased accessing time is another reason why prior art processors run at a much slower rate than is achieved by the present invention.
As will be seen, the present invention implements a pipeline structure which is capable of processing memory operations at a much faster rate (essentially at the full bus bandwidth) without any delay of waiting for the processor to generate the next address. By using this pipelined structure, the processor associated with the present invention can continue issuing addresses without having to wait for the arrival of the data from external memory. This capability enhances the presently described microprocessor when compared to prior art processors.
To achieve this performance, the present invention provides a pipelined floating-point load instruction to rapidly access data stored in external memory. This pipelined floating-point load software instruction, which is more easily referred to as "PFLoad" or "PFld", may be used by a programmer to access data which is stored either in the onchip data cache or in an external memory system. The instruction is optimized for the situation in which the data is not already residing with the processor's internal data cache. This situation is referred to as a "cache miss" or, phrased alternatively, a "PFLoad miss". The opposite case in which the data that is to be loaded is already stored within the data cache--called a "cache hit"--is also handled by the present invention.
Additionally, the PFLoad instruction of the present invention does not replace data already resident within the data cache, but rather directs the newly accessed data to a storage location within the floating point unit of the processor. The PFLoad instruction will be discussed in conjunction with its current implementation in the bus control unit of the N10 processor.