1. Technical Field
The present invention relates to a method and system for data processing in general and, in particular, to a method and system for storing data to a memory within a data-processing system. Still more particularly, the present invention relates to a method and system for front-end gathering of store instructions within a data-processing system.
2. Description of the Prior Art
A typical superscalar processor comprises multiple execution units, and each is optimized to execute a corresponding type of instructions. For example, the processor may contain a fixed-point unit (FXU) for executing fixed-point instructions, a floating-point unit (FPU) for executing floating-point instructions, a branch-processing unit (BPU) for executing branch instructions, and a load-store unit (LSU) for executing load and store instructions.
When an instruction is retrieved from a system memory for execution by the processor, the instruction is first decoded in order to determine an execution unit to which the instruction should be dispatched. In the case of a store instruction, it will be dispatched to the LSU for execution. Execution of a store instruction begins with calculating the effective address (EA) of the memory location to which the data associated with the store instruction is to be written. After the EA of the store instruction has been calculated, the execution of the store instruction is completed by committing the data associated with the store instruction to a store queue from which the data will be written to a specified memory location.
Generally speaking, with an on-chip data cache, only a small performance inefficiency may result from multiple consecutive store instructions to the system memory. In most cases, such on-chip data caches permit data access to be performed in as little as a single cycle. When store instructions are write-throughs or cache-inhibited, however, multiple consecutive store instructions will cause performance inefficiency to rise due to the additional latency of bus access.
When a page is designated as cache-allowed, the processor utilizes the cache to perform load and store operations to either the cache or the system memory, depending on the other memory/cache access attributes for the page. When a page is designated as cache-inhibited, the processor must bypass the cache and performs load and store operations directly to the system main memory in a sequential manner. In data-processing systems that utilize a store queue for the temporarily holding store instructions, it is very typical for the store queue to be implemented with a collection of registers that are organized in a First-In-First-Out (FIFO) manner. Further, the store queue may be divided into a front-end queue and a back-end queue. Store instructions are added to the entries of the front-end queue while they are removed from the entries of the back-end queue. Each entry of the store queue holds an address, a byte count, and data for a store instruction. The total number of entries for the store queue is usually small because of the size constraints of the chip, even though the overall performance may suffer because the execution of store instructions will halt when the store queue becomes full.
Consequently, it would be desirable to provide an efficient method and system for gathering these store instructions in the front-end of the store queue such that the number of instructions transferred to the data cache or the system memory via a system bus can be effectively reduced.