1. Field
The present disclosure relates to computer processors (also commonly referred to as CPUs).
2. State of the Art
A computer processor (and the program which it executes) needs places to put data for later reference. A computer processor design will typically have many such places, each with its own trade off of capacity, speed of access, and cost. Usually these are arranged in a hierarchal manner referred to as the memory system of the processor, with small, fast, costly places used for short lived small data and large, slow and cheap places used for what doesn't fit in the small, fast, costly places. The memory system typically includes the following components arranged in order of decreasing speed of access:                register file or other form of fast operand storage;        one or more levels of cache memory (one or more levels of the cache memory can be integrated with the processor (on-chip cache) or separate from the processor (off-chip cache);        main memory (or physical memory), which is typically implemented by DRAM memory and/or NVRAM memory and/or ROM memory;        controller card memory; and        on-line mass storage (typically implemented by one or more hard disk drives).        
In many computer processors, the main memory of the memory system can take several hundred machine cycles to access. The cache memory, which is much smaller and more expensive but with faster access as compared to the main memory, is used to keep copies of data that resides in the main memory. If a reference finds the desired data in the cache (a cache hit) it can access it in a few machine cycles instead of several hundred when it doesn't (a cache miss). Because a program typically has nothing else to do while waiting to access data in memory, using a cache and making sure that desired data is copied into the cache can provide significant improvements in performance.
In computer processors operations have an inherent hardware-determined time required for their execution, which is referred to as execution latency. For most operations (such as Add operation), the execution latency is fixed in terms of machine cycles. For some operations, the execution latency may vary from execution to execution depending on details of the argument operands or the state of the machine.
The issue cycle of an operation (the machine cycle when the operation begins execution) precedes the retire cycle (the machine cycle when the execution of the operation has completed and its results are available, and/or any machine consequences must become visible). In the retire cycle, the results can be written back to operand storage (e.g., register file) or otherwise made available to functional units of the processor. The number of machine cycles between the desired issue and retire cycles is the schedule latency of the operation. Note that schedule latency is in terms of the order of execution desired by the program, whether or not the desired schedule can be actually achieved by a particular operation execution. That is, the execution latency may not equal the schedule latency.
For operations of fixed execution latency, it is convenient to simply define the schedule latency to be equal to the execution latency. If such an operation is placed in an instruction issued in some machine cycle, then the results of the operation will be available naturally during the retire cycle, a number of machine cycles later corresponding to the execution latency of the operation, and consumers of those results can then be issued. This scheduling strategy is called static scheduling with exposed pipeline, and is common in stream and signal processors.
It can be difficult to statically schedule operations whose execution latency varies from execution to execution. Commonly such operations have a known minimum execution latency if all goes well, but if certain run-time events occur then the operation is delayed and cannot complete until later. Thus a load operation may complete three machine cycles after issue if the desired data are found in the top level cache, but may take hundreds of machine cycles if the data must be fetched from DRAM memory. This problem is known as a load stall, and such load stalls were the major driver for the development of out-of-order superscalar architectures. Such superscalars issue loads as soon as the address is known, as far in advance of the code that will use the loaded value as possible; the read then takes as long as it takes. While waiting for the data, a superscalar machine schedules and executes a dynamic number of other operations that are ready to execute and don't depend on the awaited value. Such a machine doesn't have a fixed number of delay slots, but has in essence a run time determined variable number of slots, as many as are needed for the data to load. Thus, a superscalar machine does not stall unless it completely runs out of operations that don't depend on the loaded value. A superscalar can have hundreds of operations in flight waiting to complete, and many operations that are waiting for their data. The cost is extreme complexity and a chip that is spendthrift in power and area.
Computer processors that employ static scheduling with an exposed pipeline are much simpler and much more economical of power and area than superscalar architectures. However, any actual stalls are much more painful, because there may be operations that are ready to execute (and that a superscalar would execute) but cannot be issued because the lock-step nature of an in-order machine is waiting for an irrelevant load to complete. Because of this difficulty, static scheduling has come to be used only for embedded applications in which the variability of memory reference latencies is bounded and small. General-purpose applications, where the variability is large, have come to use dynamically scheduled architectures which mask the variability by executing operations out of program order as soon as their arguments become available.