Microprocessor technology continues to advance at a rapid pace, with consideration given to all aspects of design. Designers constantly strive to increase performance, while maximizing efficiency. With respect to performance, greater overall microprocessor speed is achieved by improving the speed of various related and unrelated microprocessor circuits and operations. For example, one area in which operational efficiency is improved is by providing parallel and out-of-order instruction execution. As another example, operational efficiency also is improved by providing faster and greater capability to move information, with such information including instructions and/or data. The present embodiments are primarily directed at this latter capability.
Movement of information to a destination is desirable in many instances. As a first example of moving information to a destination, there is the instance where information is moved (or copied) from a memory source location(s) to a memory destination location(s). As a specific example, page management in a paged memory system moves information, such as in a copy-on-write scenario. In this scenario, often various programs share the same copy of information; however, when one of the sharing programs desires to write to the shared version of the information, a copy of that information is made and dedicated to the writing program. In making the copy, therefore, information is copied from a source address to a destination address. As a second example of moving information to a destination, there is the instance of a block clear. Again, in the context of a paged system, such a clear may occur where it is desirable to allocate a page in memory for a program. As another example, for security reasons often an area in memory will need to be cleared before it can be accessed by another program. Therefore, the operating system (or other controlling resource) will write over (i.e., clear) the relevant page frames before granting a different program access to that area in memory.
Many information movement techniques cost a considerable amount of processing time. This is not so much due to the frequency of the operations as to the size of the information moved. For example, in a paged system such as described above, often blocks on the order of 4K or larger are being moved. Indeed, the same size blocks also may be cleared and such an operation is often far more frequent than moving data from a source to a destination. In any event, these actions are quite common and burden the processor resources.
Due to the prevalence of information moves, some architectures have included instructions which are directed to such actions. For example, in the INTEL 80.times.86 system, there are included the REP MOVS and REP STOS instructions for moving information from a source to a destination or storing a fixed value to a destination, respectively. As another example, IBM mainframe techniques have included the MVCL instruction which can either move information from a source to a destination, or store fixed values to a destination. Indeed, IBM further includes the MOVPG instruction which moves pages as well as providing other functionality. In all events, processing of these instructions presents a burden on the system, and may be handled according to the particular architecture, a few of which are discussed below.
Movement of information (i.e., either data or instruction) within a computer system is contemplated in all sorts of architectures, from mainframe computer systems to single integrated circuit microprocessors. When information is moved within a computer system, it is typically desirable to move as large a block of such information as possible. For example, bus widths continue to increase toward this end. As another example, various approaches have been attempted in computer architecture to create circuits which are either exclusively dedicated or partially dedicated to move blocks of informnation. These approaches, however, often have certain drawbacks.
One approach to moving blocks of information is to include a dedicated and autonomous circuit to operate independent of the central processing unit ("CPU"). Because of its autonomy, the dedicated circuit permits a block move while the CPU is performing other operations. However, such hardware is often very complex, such as requiring address calculation and block length considerations. Thus, while performance may be improved, a cost necessarily comes with the improvement. Moreover, such approaches are known to be included only in large systems and are normally associated with a centralized storage controller which does not exist in smaller systems. Indeed, such approaches may be implemented using a separate processor to perform the block move operations. Therefore, this approach is not immediately applicable to single integrated circuit microprocessor systems.
Another approach to moving blocks of information is to include a dedicated block move circuit at the level of the execution units of a microprocessor. While this technique has been used within a single integrated circuit microprocessor system, it is also suffers drawbacks. For example, the execution units will operate having access to some baseline bus width, such as an eight byte bus width under current technology. Because the dedicated block move circuit is at the same level as the execution units, it necessarily is constrained to the baseline bus width. Thus, while it may move blocks of information independent of other execution units, it can only move a block up to the size of the baseline bus level. Thus, in the example immediately above, such a circuit could only move eight bytes at a time. As a result, if a cache having a line width of 32 bytes is being filled by such a technique, then each line takes at least four write cycles to write a 32 byte line (i.e., 4 cycles*8 bytes/cycle=32 byte line). Consequently, where it is known that a block to be moved is much larger than eight bytes, the advantage provided by the dedicated block move circuit is less than if it were operable to move a larger quantity in a single write cycle.
In view of the above, the present inventor addresses the drawbacks of certain prior block move circuits by providing various embodiments, as demonstrated below.