Generally, modern instruction set architectures build addresses for reading or writing memory by using a general purpose register as the base, and then possibly add or subtract scaled values of other registers and/or immediate values specified in the instruction to obtain a final address. This address is then used to access the memory. Thus, on the x86, the mov ax, [1+bx*4+di] instruction would add the contents of register bx multiplied by 4 to the contents of register di, add 1, and then load the contents of the memory at that address into register ax.
Some older architectures specialized the usage of registers, so that not all registers may participate in all kinds of address computations. However, the trend has been to make instruction sets orthogonal, so that all registers of a particular kind may be used interchangeably in address computation.
Some architectures, the most prominent being the Motorola 68000, had a separate register file (i.e., group of registers) whose primary purpose was to be the base for address computations. In the Motorola 68000, there were two main kinds of registers, data and address. The 8 data registers were used for most computations. The 8 address registers were used as base addresses for address computation. Only a few other operations could be performed on address registers directly (mostly add, subtract and compare); more complicated operations would require the values to be copied to the data registers, and the result copied back.
In modern processors, the address that is generated is a virtual address; the address does not correspond to a real memory location. Instead, the address first goes through a remapping process where the virtual address is translated to a real address. There are many techniques to do this. The techniques that are most commonly used involve the use of pages and translation look-aside buffers (TLBs).
In paging, the real address space is divided into pages; these are typically of some power of 2, such as 4 KB, and are aligned on the page size. Assuming 4 KB pages, address 0x000 to 0xfff are page 0, 0x1000 to 0x1fff are page 1, and so on. The virtual address for each process are similarly partitioned. Each virtual page is mapped to a real page. If virtual page 4 is mapped to real page 1, addresses 0x4000 to 0x4fff will map to real memory addresses 0x1000 to 0x1ffff.
There are multiple techniques for maintaining the full mapping between the virtual pages of the processes that are executing and the real pages of the processor. A cache of a subset of these mappings is generally kept in the processor. This cache is called the TLB (or translation look-aside buffer). The TLB is generally implemented as an N-way associative cache (typically N=1, 2 or 4), indexed by the page number of the virtual address.
After the load/store address is determined, that virtual address is translated using the TLB. If the page of the address is not in the TLB, special actions need to be taken. This may involve raising an exception in the processor, causing a special piece of software called the TLB miss handler to be invoked, that typically brings the mapping for the virtual page being accessed into the TLB. Alternatively, this TLB miss may be handled entirely or partially in hardware. In either case, after the mapping is added to the TLB, the memory access is re-tried.
In modern processors, under normal operation, a load or store will attempt to look for the data corresponding to that address in a data cache. There can be more than one level of cache in the processor; if so, the first level cache will be probed for the address. If the address is there, (a cache hit), then the value is returned (in case of a load) or written (in case of a store). If not, (a cache miss), then the second level of the cache is examined, and so on until the real memory is potentially reached. Processing a cache miss may cause the address to be added to the earlier cache levels, or it may not—it varies between implementations.
If the cache is probed using the real address, the cache is called a real addressed cache. Alternatively, the processor may choose to use virtual addresses for some of the caches, generally the first level cache. In that case, the cache is called a virtually addressed cache. A virtually addressed cache has the benefit of not requiring the translation to be performed. However, there is a drawback with virtually addressed caches. It is possible for multiple virtual addresses, even within the same process, to refer to the same address. This is known as virtual aliasing. Consider the case where two different virtual address map to the same real address. If the process performs a store using one address, and then reads the same real address using the other virtual address, and both virtual addresses are in the cache, the read will (erroneously) not see the write. There are techniques to correct for virtual aliasing, but they add complexity and are expensive, so it is preferable to use real addresses.
At first glance, it would appear that using a real addressed cache is slower than a virtually addressed cache, since the address needs to be translated before the cache is accessed. However, there are techniques available that allow the translation to proceed in parallel with the lookup. This may hide most of the delay associated with the translation, but at the cost of additional power and area.
Instruction execution on a standard von-Neumann style architecture is built around the idea of a program counter (also known as the instruction pointer and instruction counter). The model for program execution is that the processor loads the instruction stored in the memory at the address in the program counter (abbreviated to PC) and executes it. As part of the instruction execution, the PC is modified. The process is then repeated.
Based on how the PC is modified, instructions may be classified in many ways. This may include:                How the next address is specified        Whether the instruction can specify one or multiple possible next addresses        Intended use, possibly with side effects        
The most common instructions are fall through instructions—the new PC will point to the next instruction in memory. For architectures with fixed length instructions, such as 32 bit (4 byte) RISC architectures, this may be written as PC←PC+4. For architectures with variable length instructions, the program counter generally addresses bytes, but the distance to the next instruction is variable. One may write PC←PC+N, where N is the number of bytes in the current instruction.
Other instructions which may set the PC to values other than the next instruction address are called branch instructions. They may be categorized in different ways. One is how the next address is calculated.
The most straight-forward way for setting the next PC value is to have the new address as part of the instruction. These kinds of branches are called absolute branches. If A is the address specified in the instruction, this would be written as:PC←A
Many earlier architectures had absolute addressing. However, as memory sizes grew larger, this form of branching would have required larger instructions. For instance, with 4 byte addresses, the branch instructions would have required 4 bytes to specify the new PC value. In practice, most branch addresses are fairly close to the current address. So, more modern architectures use relative branches; the instruction specifies the offset or displacement from the PC of the instruction to the next instruction to be executed. If D is the displacement specified in the instruction, the new computation is expressed as:PC←PC+D
An alternative source for the address of the next PC value is the contents of some other register. In register indirect branches, the instruction specifies a register in the architecture, and the PC is set to the value of that register. If R is the register, and (R) is the contents of that register, then this may be written as:PC←(R)
There are also memory indirect branches; these branches compute an address in memory, and set the PC to the value stored at that address. There are multiple ways of computing the memory address; for instance, the data address could specify a register R and a displacement D, and use those to compute the memory address. In that case, the new PC would be computed as:PC←memory[(R)+D]
Obviously, there are other means of specifying the next PC address, such as register relative indirect (where the PC is set to the contents of a register plus a displacement) and chained memory (a form of memory indirect where a bit in the loaded memory indicates that the processor should use the contents of the memory as a address, and load from that address to get the next PC).
Branches may be unconditional, where there is only one possible target.
In conditional branches, generally, a condition is evaluated, and based on that condition, one of several possible addresses is picked for storing into the PC. Generally, on modern architectures, there are only two possibilities, and one of them is the fall-through address (i.e. the next sequential instruction). Assuming a fixed 4-byte instruction width, a conditional relative branch would be written as:
if(cond)PC←PC+DelsePC←PC+4
One variant of conditional branch is called a skip; in this case, the two choices are the next instruction and the next-to-next instruction. So, based on the condition, the next instruction is either executed or skipped, hence the name of this class of conditional branch instructions. For a fixed 4-byte instruction architecture, the skip would be written as:
if(cond)PC←PC+8elsePC←PC+4
There are more complex conditional instructions, such as the CASE instruction in the VAX-11, that can specify multiple possible next addresses, or the CAS on the IBM 704 which skipped zero, one or two instructions.
Conditional branches may be divided into two categories, based on how the condition is specified in the instruction. In the test-and-branch type of instructions, the branch instruction examines a few bits (generally one or two) of a register and branches based on that result. Generally, the bits will be from a condition code or flag register that stores the status of some previous operation, generally a comparison. Thus, on the x86, to compare two values and branch if they were equal, the instruction sequence that would be employed would be:
cmp ecx, edx; the two values are stored in ecx and edx registers
je L1
L0; fall-through, not equal case
. . .
L1; equal case
. . .
Alternatively, in the compare-and-branch instructions, the comparison is specified as part of the branch instruction. The equivalent code sequence on the MIPS architecture would be written as:
beq $t0,$t1,L1; the two values are stored in $t0 and t1
L0; fall-through, not equal case
. . .
L1; equal case
. . .
The trade-off between these two forms of branch instructions is the number of bits required to specify a branch. In the first case, the instruction set uses a small number of bits to specify the bit(s) to be examined, and the rest of the bits in the instruction may be used to specify displacements or other sources of the next address. In the second case, the instruction has to specify the details of the comparison operation, which generally requires a larger number of bits.
A major source of branches in programs are subroutine calls and returns. Generally, instruction sets have included specialized support for these kinds of branches. The support maybe quite elaborate. On the VAX-11, the CALLG/CALLS/RET instructions do all actions needed to set up and tear down a frame, including setting up the stack and frame registers, as well as saving and returning to the instruction after the CALLG/CALLS.
Minimally, on a modern architecture, a call instruction will save the address of the instruction after the call and branch to the subroutine. The return address may be saved in memory (on the stack), in a dedicated register (generally called a link register), or in a more general purpose register specified by the call instruction. A return instruction branches to that saved address. If the address is stored in a general purpose register, and the architecture has branch indirect instructions that can branch through those registers, then there may be no specialized return instruction in the architecture, with a return being performed using a regular branch indirect instruction.
A processor fetches a sequence of instructions. When a branch instruction is fetched, the processor must determine the next address to fetch. If the processor waits until the branch is evaluated, and all details about the branch target are known, it could be several cycles later. Consequently, high-performance processors try to guess what the next target of the branch would be. This is known as branch prediction.
For conditional branches, one part of branch prediction determines if the branch is a taken or fall-through. There are many techniques known; the state-of-the-art, 2 bit predictors with history, can achieve very high rates of accuracy.
For taken conditional branches, and for unconditional branches, the processor must also predict or compute the next address. This is more complicated. For a branch-with-displacement, computing the next address involves adding the displacement, typically a 10 to 16 bit number to the current program counter, typically a 32 or 64 bit value. Computing this may add significant delay to the fetch of the next address. There exist techniques that do not require the full add to complete before fetching the instruction; however, they still add to the cycle time.
There exist structures such as next-fetch-address cache that are basically cache structures that are indexed in parallel with the instruction fetch, and return the prediction of the next address to be fetched. Unfortunately, for sizes that are practical to implement, they are not very accurate.
A specialized address branch address predictor is the call stack, used to predict the address of returns. This is based on the simple observation that calls and returns are generally matched. Every time a call is encountered, the address after the call instruction (i.e., the return address for that call) is pushed onto the call stack. When a return is encountered, the address at the top of the call stack is predicted to be the target of the return, and the call stack is popped.
Fetching an instruction also involves cache lookup and translation. The TLB for data and instruction access may be the same; however, it is common for there to be a separate instruction TLB (ITLB) and a data TLB (DTLB).
The other difference between instruction fetch and data fetch is that instructions are generally immutable. Consequently, it does not matter as much if there is virtual aliasing, so it makes it much more practical for the instruction cache to be virtually addressed.