The present invention relates to the field of microprocessor instructions, and more particularly to bit scan operation instructions and binary translations.
Bit scan operations search for the least-significant bit or most-significant bit set in a data set. The Intel® IA-32 architecture provides hardware assist for such operations via the BSF (bit scan forward) and BSR (bit scan reverse) instructions. On the IA-32 architecture these instructions operate on 8-bit, 16-bit or 32-bit values. Software may combine these instructions to operate on larger data sets.
FIG. 1 illustrates an example of the BSF and BSR instructions operating on an 8-bit binary value X. Box 102 shows the BSF instruction operating on an 8-bit value X. Since the bit in position 0 corresponds to the least significant bit of X set to a value of ‘1,’ BSF(X)=0. Similarly, box 104 illustrates the BSR instruction operating on an 8-bit value X. Since the bit in position 6 corresponds to the most significant bit of X set to a value of ‘1,’ BSR(X)=6.
Microarchitecture implementations may support bit scan operations either by a sequence of microcode operations, or by providing special ALU (arithmetic logic unit) operations. The use of microcode may be expensive and impact performance, while the use of special ALU operations may be costly in terms of chip area. Binary translators and emulators for architectures that have bit scan operations, such as the IA-32 Execution Layer (IA32EL), support those operations via code sequences provided by the underlying instruction set.
One possible microcode implementation of a bit scan operation may use a microcode loop. The microcode loop implementation may lead to variable execution times, which is not a good match for instruction schedulers.
Another possible implementation is a look up table, in which the data value being scanned is used to index a table. This implementation is only economical for small data tables. Some approaches combine the data table with the microcode loop described above, thus “breaking” the input data into smaller chunks. This implementation is used in the IA32EL binary translator from Intel® x86 to Intel® Itanium™ microprocessor instructions. This implementation results in a less efficient code sequence consisting of 22 Itanium™ microprocessor instructions. Furthermore, if bit-scan operations are infrequent, the table look up approach may incur cache penalties from cache misses which may further hinder performance.
Another approach recursively breaks the data set in halves to perform the operation in a number of iterations proportional to the logarithm of the data set.