1. Field of the Invention
The present invention relates to an apparatus and method for loading data values, and in particular to a technique for loading data values from a memory system so that they are subsequently available for use by a data processing unit.
2. Description of the Prior Art
A known data processing apparatus, for example a processor core, may comprise a data processing unit operable to execute instructions, and a register file having a plurality of registers for storing data values accessible by the data processing unit when executing those instructions. It will be appreciated that the data values may take a variety of forms, and by way of example may take the form of 32-bit data words, with each register of the register file being able to hold one such data word.
A memory system accessible by the data processing apparatus will typically be provided for storing data values, with data values required by the data processing unit being loaded from the memory system into the register file from where they can be accessed by the data processing unit. Subsequent to manipulation by the data processing unit, data values are typically stored from the register file back to the memory system, thereby freeing up space within the register file for subsequent data values to be loaded.
It will be appreciated that when executing a typical program on the data processing unit, a significant number of such load and store operations will need to be performed. The time taken to load data values from the memory system into the register file can have a very significant impact on the performance of the data processing apparatus. In an attempt to seek to reduce the time taken to load data from the memory system, also referred to herein as the load latency, it is known to arrange the memory system in a hierarchical manner, such that the memory system consists of a number of layers of memory. In such an arrangement, there will typically be at least one layer which can hold only a relatively few number of data values, but which can be accessed relatively quickly by the data processing apparatus, with the memory system also including at least one layer which is significantly larger, also referred to herein as bulk memory, and hence can store more data values, but which can only be accessed relatively slowly.
In a typical arrangement, when the data processing unit of the data processing apparatus issues a memory access request to the memory system, that request will first be received by a layer of the memory system which is relatively small but can be accessed quickly. For the purposes of the following description, that layer will be referred to as the layer one level of the memory system, and typically is implemented by a cache. If the requested data value is present in that layer one cache, then it can be returned to the data processing apparatus relatively quickly. However, in the event that the data value is not present in the layer one cache, then the access request will need to be propagated to one or more lower levels of the memory system in order to identify and retrieve the required data value, with the resultant increase in time taken to return that data value to the data processing apparatus. Typically as the data value is returned to the data processing apparatus, it will also be stored within the layer one cache, such a process being referred to as a linefill process.
Whilst such a hierarchical memory system can hide somewhat the latency of bulk memory, it is clear that the cache is only effective in reducing the load latency if the data value required is present in the cache. In an attempt to seek to increase the likelihood that the data value will be present in the cache, it is known to employ preload instructions which are typically placed at an earlier location within the program code than the real load instruction, and which can be used as a hint to the memory system that a real load is likely to take place in the near future. Whilst the preload instruction typically has no effect within the data processing apparatus, in that it is treated as a NOP (“no operation”) instruction and hence does not cause any update of the data processing apparatus architectural state, the memory system itself can make use of this preload instruction by causing a linefill process to take place if required to ensure that the data value is then present in the layer one cache prior to the real load being issued. Hence, when the subsequent load instruction issues for real, the data value will be located within the layer one cache, and can be loaded into the register file relatively quickly.
Hence, the use of such preload, or hint, instructions can further help hide the latency of bulk memory. However, there are other performance limiting features that are becoming more significant as processors become more advanced, and in particular operate more quickly. Most modern processors are arranged in a pipelined manner, which allows multiple instructions to be in the process of execution at any point in time, and there is a desire to increase processor performance through higher operating frequencies. As clock frequencies increase, and pipeline depths tend to farther increase, it has been found that even the layer one cache of the memory system has difficulty keeping up with the requirements of the processor, and accordingly can negate the performance improvement obtained by operating the processor at a higher frequency. This performance impact can be expressed in terms of a “load-use penalty”, the term load-use penalty referring to the time it takes between issuing a load instruction and the point being reached where the data loaded by that instruction is available for a subsequent instruction. The load-use penalty is becoming particularly important in modern day processors, for example (but not limited to) processor cores using commodity compiled RAM Random Access Memory).
As an example of load-use penalty, consider the following code sequence:
LDR r0, [r1, r2]
ADD r3, r4, r0
As can be seen, the add instruction has a dependency on the result of the load instruction, since one of its operands is r0. In a 5-stage pipeline (consisting of fetch, decode, execute, memory and write-back stages), the flow of execution might be
1234567LDRFDEMWADDFD—EMW
For the LDR instruction, the effective address (ea) is calculated in the ‘E’ stage, and is the sum of r1 and r2. This ea is issued to the L1 memory system in the ‘E’ stage (cycle 3) and the memory system must respond (L1 cache hit case) with the data by the end of the ‘M’ stage (cycle 4).
As can be seen, the ADD instruction ideally needs the value of r0 at the start of cycle 4 but it is not available then. Hence a single cycle stall occurs which pushes the ADD ‘E’ stage to the right by a cycle.
Therefore this example has a load-use penalty of 1.
The load-use penalty is therefore an effect of pipelining of the memory access. The latency of the memory system has an impact on how deep this pipelining must be.
In a higher frequency system, more time may be need to access L1. As an example, consider the following 8-stage pipeline:
1234567891011LDRF1F2DIE1E2E3WADDF1F2DI——E1E2E3W
In this example, the LDR calculates the ea in cycle 5, issues it to the memory system in cycle 6, receives the data at the end of cycle 7 and writes it to the register file/forwards to other instructions at the end of cycle 8.
The ADD instruction needs the data at the start of its E2 stage. This example illustrates a load-use penalty of 2.
As frequencies increase there is the possibility of further pipelining being needed.
Common techniques to reduce the impact of load-use penalty are at the software compilation stage. If the compiler can separate the LDR from the dependent instruction with other (useful) instructions then the load-use penalty can be hidden.
There are limitations to this approach—for example, sometimes there might not be enough instructions available (in terms of what the program needs) to be able to separate the load and using instruction.
As the load-use penalty increases, the number of suitable “separating” instructions that must be identified increases and the problem rapidly becomes intractable.
Accordingly, it will be desirable to provide a technique which further improves the speed of a load operation within the data processing apparatus.