1. Technical Field
The technical field of the present specification relates in general to a method and system for data processing and in particular to a processor and method for storing data to a memory within a data processing system. Still more particularly, the technical field relates to a processor and method for store gathering through merging store instructions.
2. Description of the Related Art
A typical state-of-the-art processor comprises multiple execution units, which are each optimized to execute a corresponding type of instruction. Thus, for example, a processor may contain a fixed-point unit (FXU), a floating-point unit (FPU), a branch processing unit (BPU), and a load-store unit (LSU) for executing fixed-point, floating-point, branch, and load and store instructions, respectively.
When a store instruction is retrieved from memory for execution by a processor, the instruction is first decoded to determine the execution unit to which the instruction should be dispatched. After the store instruction is decoded, the store instruction is dispatched to the LSU for execution. Execution of a store instruction entails calculating the effective address (EA) of the memory location to which the data associated with the store instruction is to be written. After a store instruction has finished, that is, the EA of the store instruction has been calculated, the store instruction is completed by committing the data associated with the store instruction to a store queue from which the data will be written to the specified memory location.
In order to reduce the number of cycles required to store and retrieve data, processors are often equipped with an on-board upper level data cache. Such upper level data caches permit data accesses to be performed in as little as a single cycle. Because of the minimal data latency associated with data accesses to cached data, only a small performance inefficiency results from multiple consecutive stores to the same doubleword in memory. However, in data processing system configurations without caches or in which store instructions are cache-inhibited or write-through, performance inefficiency arises from multiple consecutive stores to the same doubleword due to the additional latency of bus accesses.
Consequently, it would be desirable to provide an efficient method and system for storing data to memory within a data processing system which minimize the number of cycles required to perform multiple store accesses to the same doubleword.