1. Field of the Invention
The present invention relates to techniques for predicting memory access in a data processing apparatus.
2. Description of the Prior Art
A data processing apparatus will typically include a processor, for example a processor core, for executing a sequence of instructions that are applied to data items. Typically, a memory may be provided for storing the instructions and data items required by the processor core. Further, it is often the case that one or more caches are provided for storing instructions and data items required by the processor core, so as to reduce the number of accesses required to the memory.
It is known for such processor cores to incorporate one or more pipelines for executing instructions, each pipeline having a plurality of stages. The provision of pipelines enable multiple instructions to be in the process of execution at any one time which can increase the throughput and efficiency of the processor core. Typically, as instructions step through the pipeline any resources required to process those instructions such as data items from memory or registers are made available at the appropriate pipeline stage. Typically, a clock signal is provided which provides timing information to control the rate at which instructions step through the pipeline.
In an ideal scenario, each instruction spends one clock cycle in each pipeline stage, and then moves to the next pipeline stage. However, as will be appreciated by those skilled in the art, there are various scenarios in which it will be necessary to keep an instruction within a particular pipeline stage for more than one clock cycle, for example because the processing of that instruction required by that pipeline stage will require more than one clock cycle, or because processing of that instruction at that pipeline stage cannot take place in the current clock cycle because all of the information or data required to enable that processing to take place is not available. In such scenarios, the particular pipeline stage in question will issue one or more stall signals which are then processed by control logic of the pipelined processor to determine whether it is necessary to stall any preceding pipeline stages. The receipt of a stall signal by a pipeline stage prevents that stage from providing its instruction and any associated information or data to the next stage. Typically, assuming the immediately preceding pipeline stage contains an instruction, it will be stalled, and this stall will be replicated down the pipeline.
The processing speed of the pipeline is limited by its critical path which, in many implementations, is dependent on the minimum time needed by any one stage to generate and propagate a signal required to control stages during the next cycle. It will be appreciated that the critical path may be any path, but in many designs the critical path is often the time needed to generate and propagate, for example, the stall signal. Hence, even if the rate at which instructions step through the pipeline can be increased, the rate cannot be any faster than that minimum time needed by that stage to generate and propagate such signals.
To reduce complexity when accessing data items to or from memory, the memory is typically arranged such that only predetermined blocks of data may be accessed during a single access cycle. It will be appreciated that the size of the block is a matter of design choice, and may for example be selected to be that of the most commonly-accessed size of data item, for example a word. These blocks of data in the memory are delimited by so-called address boundaries. By accessing memory in blocks the memory interface can be simplified. Also, when accessing data items within adjacent address boundaries, it will be appreciated that the data items can be accessed in a single access cycle.
However, there may be instances where data items to be accessed are not within adjacent address boundaries. This occurs typically if a data item to be stored has a size which is different to the size of the blocks of data. This variation in size can occur for example when variable length coding is used. It is possible in situations where the data item to be stored is smaller than the size of the predetermined blocks for the data item to ‘padded’ with null data to make that data item align with the address boundaries. However, the use of padding is undesirable since it reduces the storage efficiency of the memory. Hence, if padding is not used then subsequent data items may be split across both sides of an address boundary. Also, if a data item is larger than the block then clearly that data item will be split across both sides of an address boundary. It will be appreciated that when the data item to be accessed is on both sides of an address boundary, multiple access cycles will be required to access that data item.
When processing a memory instruction (such as a load or store instruction) in the pipeline, an address in memory to be accessed will need to be generated usually by an arithmetic operation performed on operands associated with that memory instruction. The operands may be specified directly in the instruction itself, or by reference to the contents of one or more registers. It will be appreciated that a finite time will be required for such an address generation and that, typically, address generation takes most of a clock cycle.
In the particular case where the generated address indicates that the data item to be accessed crosses an address boundary then previous stages in the pipeline may need to be stalled whilst the multiple access takes place, since typically such multiple accesses will take more than one clock cycle. Given that a signal to stall previous stages cannot be issued and propagated to earlier stages in the pipeline until the address generation completes then it is clear that this can be a critical path in the processor. This is because the rate at which instructions are clocked through the pipeline is constrained by the speed at which the stall signal can be routed to the necessary pipeline stages following address generation. This constraint undesirably limits processing speed of the processor.
Accordingly, it is desired to provide an improved technique for determining whether a data item to be accessed crosses an address boundary and will hence require multiple memory accesses.