In many processor-based systems, the processor provides instructions tuned for efficient implementation of copy or store operations. Optimized software for memory copy operations is tuned for a specific processor implementation. In many cases, the optimal way for performing the data copy is changing, and the code serves as a moving target for compiler, operating system (OS) kernel and application writers, which are forced to use multiple proliferations tuned for the different scenarios, different micro-architectures and so forth.
An iterative copy instruction can be used to copy a certain amount of data elements as specified by one of the instruction's parameters. Iterative copy operations may have different native data element lengths, such as byte, word double word, quad word, etc. The longer the native length is, the instruction may be more efficient in moving a quanta of data since it may use larger ‘load’ and ‘store’ operations. For example, in Intel® Architecture (IA32) architecture a repeat move byte (REP MOVSB) instruction uses the value in a given register as indicator of the length of the copy. In addition, the instruction receives source pointer and destination pointer as input parameters. Such instruction is defined to move one byte of data ‘one at a time’. In some cases, the instruction's implementation may switch to a ‘fast mode’ where the operations are performed using longer operations (e.g., 16-bytes at a time). The IA32 programmer's reference manual defines the conditions in which such fast-mode may be executed in current processors.
As the length of copy and set operations is in many cases unknown at compile time, one solution for improving the efficacy of the copy operations with prior implementations of the iterative copy operations is to use a first iterative copy instruction that moves the majority of the string followed by a second iterative copy instruction that moves the remainder of the data (e.g., first copy operation moved double word at a time and second copy the last 0-3 bytes). Such sequence has two drawbacks: (a) the second instructions cost additional cycles that are always paid even when the remainder is zero; and (b) the optimization is tuned for a specific length of the first iterative copy instruction followed by only a limited sequence of instructions for the second; any other combination will cause a significant performance loss.
Further, in a pipelined machine, it often happens that an instruction's best behavior needs to be decided at instruction decode time, even though some of the data required for making the decision is unknown or is not committed yet. One example of this is branches, which need to be taken or not-taken depending on flags, even if the flags are not calculated yet. To resolve such problem the most common scheme is the use of branch predictors. Such predictors require time for training (building the history), have high costs (as much state needs to be saved), and their performance under flaky patterns is uncertain.