The present invention relates generally to a superscalar processor with very long instruction word (VLIW)-like dispatch groups and more particularly to the decode and treatment of instructions which access volatile address space.
Superscalar processors employ aggressive techniques to exploit instruction-level parallelism. Wide dispatch and issue paths place an upper bound on peak instruction throughput. Large issue buffers are used to maintain a window of instructions necessary for detecting parallelism, and a large pool of physical registers provides destinations for all of the in-flight instructions issued from the window beyond the dispatch boundary. To enable concurrent execution of instructions, the execution engine is composed of many parallel functional units. The fetch engine speculates past multiple branches in order to, supply a continuous instruction stream to the decode, dispatch and execution pipelines in order to maintain a large window of potentially executable instructions.
The trend in superscalar design is to scale these techniques: wider dispatch/issue, larger windows, more physical registers, more functional units, and deeper speculation. To maintain this trend, it is important to balance all parts of the processor-any bottlenecks diminish the benefit of aggressive techniques, however data dependent decode very adversely affects the performance gains of these techniques.
FIG. I illustrates a block diagram of a typical processing system. The processing system includes a processor 301 and a cache 302 which communicates with a host bus 300. The host bus also communicates with a memory controller 303 which in turn provides and receives information from the system memory 304. The memory controller 303 in turn communicates with another bus, in this example a PCI bus 100. The PCI bus communicates with an IDE controller which in turn is connected to a hard disk drive 111. Also the PCI bus communicates with a video adapter 102 which is in turn coupled to a CRT 112. PCI bus 100 also is coupled to an ISA bus through a PCI/ISA interface 103. The ISA bus 200 in turn is coupled to an ethernet or TokenRing controller which is coupled to a network or a local area network (LAN). It communicates with another video adapter 202 which has its associated CRT 212 and an IDE controller 201 which is coupled to a hard disk drive 211.
One of the critical bottlenecks in such a processing system is load and store bandwidth, this is particularly true for machines which operate at higher frequencies because of the growing disparity in processor, I/O bus, and main memory operating frequencies. Since most processor architectures which are currently prevalent, x86 (IA-32), PowerPC/AS, ARM, etc., were implemented before this memory/logic frequency disparity became so pronounced, many contain an implementation or manifest some type of volatile I/O space or strongly ordered memory in one or more of their respective system architectures.
This can simply be defined as address space which if accessed multiple times will respond with different data. An example of this would be a memory-mapped FIFO in a video or communications adapter, or a multiplicity of addresses which if accessed in different order will respond with different data.
The requirement that this be supported has a devastating effect on processor implementations and performance because it requires the physical or effective address (depending on the architecture) to be compared against some table, range register, or other checking mechanism to determine if the address can be accessed out-of-order. This is further compounded by attempts at adding wider dispatch groups which optimally can be done in a VLIW-like dispatch group which has no ability to maintain ordering within the dispatch group. Since the actual address is not known at instruction decode time a processor which implements such a VLIW-like dispatch groups must block execution and flush the VLIW-like dispatch group and reformat the individual instructions of the VLIW-like word into the individual instructions forming a safe and lower performance sequence.
In a very high-frequency processor which has a deep pipeline this has an unacceptably high performance penalty for any code stream which might even occasionally access this type of storage.
This problem manifests itself in a processor supporting the PowerPC/AS architecture. Additionally, all addresses within the particular guarded range must be accessed in program order. Guarded is defined in this application as an address which must only be accessed once for each datum. There is no way to distinguish between guarded storage for different adapter/devices so all accesses to guarded space must be performed in strict program order.
Direct storage is different from guarded because a single memory address can be accessed multiple times without changing its value, but the order of accesses must be maintained. The present invention optimizes the performance of this strict architectural requirement in a VLIW-like processor.
A method and system for optimizing execution of an instruction stream which includes a very long instruction word (VLIW) dispatch group in which ordering is not maintained is disclosed. The method and system comprises examining an access which initiated a flush operation; capturing an indice related to the flush operation; and causing all storage access instructions related to this indice to be dispatched as single IOP groups until the indice is updated.
Storage access to address space which is safe such as Guarded (G=1) or Direct Store (E=DS) must be handled in a non-speculative manner such that operations which could potentially go to volatile I/O devices or control locations that do not get processed out of order. Since the address is not known in the front end of the processor, this can only be determined by the load store unit or functional block which performs translation. Therefore, if a flush occurs for these conditions, in accordance with the present invention the value of the base register (RA) is latched and subsequent loads and stores which use this base register are decoded in a xe2x80x9csafexe2x80x9d manner until an instruction is decoded which would change the base register value (safe means an internal instruction sequence which can be executed in order without repeating any accesses). The value of multiple base registers can be tracked in this manner, though the preferred embodiment would not use more than two, one of the base registers could be for input and one could be for output streams.