The methods described in this specification aim to improve a processor's average inter-instruction Hamming distance. The next few paragraphs describe this metric and explain its relation to power efficiency.
The Hamming distance between two binary numbers is the count of the number of bits that differ between them. For example:
Numbers inNumbers in binaryHammingdecimal(inc. leading zeros)distance4 and 50100 and 010117 and 100111 and 101030 and 150000 and 11114
Hamming distance is related to power efficiency because of the way that binary numbers are represented by electrical signals. Typically a steady low voltage on a wire represents a binary 0 bit and a steady high voltage represents a binary 1 bit. A number will be represented using these voltage levels on a group of wires, with one wire per bit. Such a group of wires is called a bus. Energy is used when the voltage on a wire is changed. The amount of energy depends on the magnitude of the voltage change and the capacitance of the wire. The capacitance depends to a large extent on the physical dimensions of the wire. So when the value represented by a bus changes, the energy consumed depends on the number of bits that have changed—the Hamming distance—between the old and new values, and on the capacitance of the wires.
If one can reduce the average Hamming distance between successive values on a high-capacitance bus, keeping all other aspects of the system the same, the system's power efficiency will have been increased.
The capacitance of wires internal to an integrated circuit is small compared to the capacitance of wires fabricated on a printed circuit board due to the larger physical dimensions of the latter. The kind of systems that we are considering will normally have memory and microprocessor in distinct integrated circuits, interconnected by a printed circuit board. Therefore we aim to reduce the average Hamming distance between successive values on the microprocessor-memory interface bus, as this will have a particularly significant influence on power efficiency.
Even in systems where microprocessor and memory are incorporated into the same integrated circuit the capacitance of the wires connecting them will be larger than average, so even in this case reduction of average Hamming distance on the microprocessor-memory interface is worthwhile.
Processor-memory communications perform two tasks. Firstly, the processor fetches its program from the memory, one instruction at a time. Secondly, the data that the program is operating on is transferred back and forth. The instruction fetch makes up the majority of the processor-memory communications.
The instruction fetch bus is the bus on which instructions are communicated from the memory to the processor. We aim to reduce the average Hamming distance on this bus, i.e. to reduce the average Hamming distance from one instruction to the next.
Instruction formats will now be discussed.
A category of processors which is suitable for implementation of the invention is the category of RISC (Reduced Instruction Set Computer) processors. One defining characteristic of this category of processors is that they have regular, fixed-size instructions. In our processor all instructions are made up of 32 bits. This is the same as the size of the instruction fetch bus.
Each instruction needs to convey various items of information to the processor. These items include:                Operation codes (opcodes) indicating which basic action, such as addition, subtraction, etc. the processor should carry out.        Register specifiers, indicating which of the processor's internal storage locations (registers) should supply operands to or receive results from the operation.        Values that are used directly as operands to the function called immediate values.        
For example, an instruction that tells the processor to “add 10 to the value currently in register 4 and store the result in register 5” would have the opcode for ‘add’, register specifiers 4 and 5, and immediate value 10.
We consider a processor with an instruction set which has only three instruction formats. The first has a five-bit opcode and a 26-bit immediate value. The second has a five-bit opcode, two five-bit register specifiers, and a 16-bit immediate value. The third has a five-bit opcode, six bit secondary opcode and three five-bit register specifiers. The fields that are common to all of the different instruction formats, such as the primary opcode field, are arranged to always be in the same bit positions:

Instruction scheduling will now be discussed.
Instruction scheduling is a known technique in which the order of the instructions is permuted in order to improve performance. Such modifications change the order in which instructions are executed, but do not modify the overall behaviour of the instruction sequence. To ensure this, any modification to the order of instructions must take into account the dependencies between instructions. In a traditional instruction scheduler, the goal is to avoid contention for CPU resources.
For example, consider the following C code fragment:
int foo(int *array1, int *array2, int size) {1int loop, result;2result= 0;3for (loop= 0; loop<size; loop++) {4result= result + (*(array1++)* *(array2++));}5return result;}
This compiles into the assembly language shown over the page. While we have chosen to illustrate the techniques using our own processor instruction set, the techniques apply equally well to any other similar 3-operand processor.
A short line has been used to separate the basic blocks that make up the program. Basic blocks are sections of the code that end with a change in program flow (a branch or call instruction). A new basic block is also started by a labelled instruction.
Next to the assembly language, we have shown the bit patterns representing the instructions. Any unused (or unallocated) bits have been shown as an ‘X’. The processor ignores these bits when it is decoding the operation described by an instruction. It is possible to optimise these bits to further reduce the Hamming distance, as we have described in a separate patent application. Such an optimisation should ideally be carried out in parallel with the optimisations described in this document.
In the bit patterns, any unresolved immediate fields are marked with an ‘i’. These are immediate values which are specified using a symbolic reference whose actual value is defined externally to the current compilation unit and therefore cannot be resolved into an actual value until the final link stage has taken place.
The final column shows the Hamming distance between the previous instruction and the current one. For the sake of simplicity, the unused bits are ignored when performing this calculation.
#1 1addi% sp, % sp, #-20x00100.11110.11110.1111111111101100 2st.w0(% sp), %8x.10101.11110.01000.000000000000000018 3st.w4(% sp), %9x.10101.11110.01001.00000000000001002 4st.w8(% sp), %10x.10101.11110.01010.00000000000010004 5st.w12(% sp), %11x.10101.11110.01011.00000000000011002 6st.w16(% sp), %1rx.10101.11110.11111.00000000000100005 7ori%10, %0, #0x.01100.00000.01010.000000000000000011 8ori%9, %1, #0x.01100.00001.01001.00000000000000003 9movi%11, #0x.00010.xxxxx.01011.0000000000000000410cmplt%7, %11, %2x.00000.01011.00010.xxxxx.000110.00111811bz%7, L4x.11010.00111.xxxxx.iiiiiiiiiiiiiiii5#212ori%8, %2, #0x.01100.00010.01000.00000000000000005#313 L6:ld.w%0, (%10)x.10100.01010.00000.0000000000000000314ld.w%1, (%9)x.10100.01001.00001.0000000000000000315call_mu1si3x.11111.iiiiiiiiiiiiiiiiiiiiiiiiii3#416add%11, %11, %0x.00000.01011.00000.xxxxx.000000.01011517addi%9, %9, #4x.00100.01001.01001.0000000000000100818addi%10, %10, #4x.00100.01010.01010.0000000000000100419addi%8, %8, #-1x.00100.01000.01000.11111111111111111720bnz%8, L6x.11011.01000.xxxxx.iiiiiiiiiiiiiiii5#521 L4:ori%0, %11, #0x.01100.01011.00000.0000000000000000622ld.w%8, 0(% sp)x.10100.11110.01000.0000000000000000523ld.w%9, 4(% sp)x.10100.11110.01001.0000000000000100224ld.w%10, 8(% sp)x.10100.11110.01010.0000000000001000425ld.w%11, 12(% sp)x.10100.11110.01011.0000000000001100226ld.w%1r, 16(% sp)x.10100.11110.11111.0000000000010000527addi% sp, % sp, #20x.00100.11110.11110.0000000000010100328jmpr(%1r)x.00000.11111.xxxxx.xxxxx.011110.xxxxx6Total:148 transitions
When scheduling this code, no instructions can move outside of their original basic block. In addition, the following dependencies must be preserved:
Basic Block #1#1before#2, #3, #4, #5, #6Dependency on % sp#4before#7Dependency on %10#3before#8Dependency on %9#5before#9Dependency on %11#10before#11Dependency on %7#11after#1 to #10
Basic Block #4#19before#20Dependency on %8#20ater#16 to #19
Basic Block #5#21before#25Dependency on %11#22, #23, #24, #25, #26before#27Dependency on % sp#28after#21 to #27
To explain how to interpret these tables, consider instructions #1 to #7. From the table, we can see that instruction #1 modifies the stack pointer (% sp), which is subsequently used in instructions #2 to #6. Therefore, this instruction must always precede the other instructions. Instruction #7 updates register % 10, so must be scheduled after the old value has been saved. This is achieved in instruction #4.
Any order of the instructions that respects these dependencies will have the same behaviour as the original code. We aim to find an order with the minimum overall Hamming distance. For example, one possible choice is:
addi% sp, % sp, #-20X.00100.11110.11110.1111111111101100st.w16(% sp), %1rX.10101.11110.11111.000000000001000017st.w0(% sp), %8X.10101.11110.01000.00000000000000005st.w4(% sp), %9X.10101.11110.01001.00000000000001002st.w12(% sp), %11X.10101.11110.01011.00000000000011002st.w8(% sp), %10X.10101.11110.01010.00000000000010002mov%10, %0X.01100.00000.01010.00000000000000008
The function of this code has not changed, but the total Hamming distance has been reduced from 42 down to 36 (a 15% saving). There is no reduction in code performance as a result of this alternative scheduling.
Any of the existing code scheduling algorithms can be adapted to perform this task. Instead of maximising functional unit usage, the ‘cost’ function is modified to be the Hamming distance to the neighbouring instructions.
One such algorithm to perform this task is as follows:                1. Split the code into ‘basic blocks’ (code which does not contain a change in execution path, or begins with a label).        2. For each basic block in turn, create a set of all the instructions that are contained in that basic block. This is the block set, B.        3. Determine the dependencies between the instructions in B.        4. Find a set of candidate instructions, C, from the instructions in B that are the instructions from B that do not depend on any other instruction in B. These are the instructions that could be scheduled next without changing the behaviour of the application.        5. Select the instruction from C that has the smallest Hamming distance from the previously output instruction (which may actually be in the previous basic block). Output this instruction and remove it from B.        6. Repeat from stage 3 until B is empty.        7. Repeat from stage 2 until all of the basic blocks have been processed.        
To illustrate a problem with this algorithm, consider the following pseudo-code:
instruction #1L1:instruction #2instruction #3branch L1instruction #4
The choice of instruction #4 is made on the basis of the branch instruction. Similarly, the choice of instruction #2 is made on the basis of instruction #1. This fails to take into account the dynamic behaviour of the code.
The difficulty arises when considering instruction #2. The simple algorithm described will choose instruction #2 based on instruction #1. But as we are dealing with a loop, it is more likely that instruction #2 will be fetched more frequently as a result of the backwards branch being taken. Therefore, it should be chosen so that it minimises the Hamming distance from the branch instruction.
For processors that have a branch shadow (one or more instructions that are fetched from the location(s) sequentially after a branch, even if the branch is executed) then instruction #2 should be chosen on the basis of the last instruction fetched in the shadow. For a single instruction shadow, this would be instruction #4.
We have described the known process of instruction scheduling. In the past, such rescheduling has been done at the compilation stage. However, from the point of view of reducing Hamming distance, this suffers from the disadvantage that at the compilation stage not all symbols have been fully resolved.