Modern high-performance CPUs utilize speculative execution paths to improve instruction throughput. One form of speculative execution is branch prediction. Branch prediction enables the processor to begin executing instructions before the true branch path is known. When encountering branching code such as “if x, do foo; else do bar,” the CPU tries to predict what x will be and begins executing foo or bar before x is known. If the CPU predicts correctly, execution continues with the added performance benefit. If the CPU predicts incorrectly, the result of the speculative execution is discarded.
Similarly, some CPUs engage in value speculation to generate values that are either predicted or computed using a predicted value.
CPU loads can also be performed speculatively. For example, if foo is being executed speculatively and foo requires a load, that load will be performed along with the speculative execution of the foo branch. Such speculative loads, however, can lead to security risks.
Malicious code may attempt to exploit a CPU's speculative load to gain access to locations in memory that would otherwise be architecturally impermissible. For example, a code block may conditionally request a load from an impermissible memory location. Speculative loading will cause the CPU to load data from this impermissible memory location before determining whether the memory access is permissible. In the normal case, this speculative load (and impermissible memory access) will not be accessible to the underlying code because the speculative load will be rolled back and made inaccessible when the impermissible memory access is detected. But unfortunately, this may not be the end of the story.
Malicious coders can be quite ingenious. Even though a speculative load will be rolled back and made inaccessible when impermissible memory access is detected, it is possible for an attacker to determine the value of a speculative load from an impermissible memory location by adding a second speculative load that is dependent on the value of the first speculative load. For example, a code block may request a load from memory location A if the value of the first speculative load is 0 and from memory location B if the value of the first speculative load is 1. Even after the impermissible memory access is detected and rolled back, it is still possible to determine which of the memory locations A or B was loaded because they will remain present in the processor's L1 cache. Thus, any subsequent requests to memory locations A or B will reveal, based on the timing of the response, whether those memory locations are present in the L1 cache. If it is revealed that memory location A is present in the L1 cache, the value at the impermissible memory location must have been 0. If memory location B is present in the L1 cache, the value at the impermissible memory location must have been 1. In this way, it is possible to determine (deduce) the value stored at an arbitrary memory location even when access is architecturally impermissible.
For more detailed information concerning such attacks and how they exploit modern computer processors that use cache memory and speculative execution, see for example the following technical articles that are incorporated herein by reference as if expressly set forth:    Lipp et al, “Meltdown” arXiv:1801.01207 [cs.CR] (2018), published at https://meltdownattack.com/    Kocher et al, “Spectre Attacks: Exploiting Speculative Execution” arXiv: 1801.01203 [cs.CR] (2018), published at https://meltdownattack.com/    Yarom et al, “Flush+Reload: a High Resolution, Low Noise L3 Cache Side-Channel Attack,” USENIX Security Symposium (2014).
In such contexts, the term “side channel” is a general term used to describe methods that are able to derive information about a processor that are outside of the processor's architectural specification. There are many kinds of side channels, including performance counters. Other examples include the processor making different sounds upon executing different instructions. The side channel space thus includes a wide range of differences between the logical architectural specification of the processor as defined by the processor's architects, and the processor's actual implementation as specified by its designers. Like burglars who break into a building through a crawlspace the architects never designed into the structure, it has become common for attackers to exploit—for nefarious purposes—various aspects of processor side channels in ways the processor architects and designers never contemplated or foresaw.
For example, modern processors often have performance metric counters that track how long it takes for a particular memory load to execute. As discussed above, if an attacker can learn how long it took for the data to load, he can sometimes use this information to learn the contents of the data. It is also possible by determining whether there is a miss in the cache memory for an attacker to intuit the content of the data itself such attacks can for example exploit the shared, inclusive last-level cache. The attacker may frequently flush a targeted memory location. By measuring the time it takes to reload the data, the attacker can determine whether the data was loaded into the cache by another process in the meantime. This is known as one type of “cache attack.”
In more detail, FIG. 1 shows an example simplified scenario in which a memory 10 stores a secret 12. FIG. 1 shows speculative execution the attacker is controlling (i.e., access one memory location if the secret value is zero, and access a different memory location if the secret value is 1). FIG. 1 thus further shows that a first cache line 14 is written into cache memory 18 if a result of the speculative execution is zero, and a second cache line 16 is written into the cache memory if the speculative execution is one. The attacker can then use a side channel to detect which of the cache lines (the first cache line 14 or the second cache line 16) is present. From the result of this detection, the attacker can derive whether the secret 12 is zero or one. The side channel attack thus permits the attacker to detect which cache line is present without actually reading either cache line (reading either cache line would generate an exception because the memory access would be privileged), and learn the value of the secret without actually reading the secret value from memory (which the processor architecture typically successfully prohibits).
The general problem of unexpected data observation as a result of hardware speculation by the processor is very difficult to resolve. Correctly speculating which data the processor is going to access is a very large source of performance, with a wide variety of methods developed without regard to timing attacks. Attempting to enumerate all cases where the hardware speculates due to secret data and performs some observable timing effect is an intractable problem. Trying to eliminate all possible side channels is also intractable.
Prior attempted solutions have been to use explicit software barriers, for example ARM's ISB/DSB and x86's LFENCE. Unfortunately, this runs into three problems, namely these are expensive in performance, software generally doesn't know where to put them, and their use tends to be architecture specific causing headaches when implementations implement barriers differently with regards to speculation.
It is also common to use some combination of the physical address, virtual address, ASID (address space identifier), VMID (virtual machine ID), Exception Level (EL) or Privilege Level hashed in some way to form the final branch predictor index and tag. Such a hash won't avoid cross talk between mismatched ASID, VMID, ELs due to aliasing in the hash function, which a sufficiently informed attacker could exploit. Note that some types of virtual address aliasing may comprise a degenerate case of this type of cross talk, and as such the stated attack is unlikely to be contained to those implementations which have a virtually-indexed/tagged branch predictor. Demanding that all bits be used as part of branch predictor tags is die-area onerous, as it adds extra bits of tag storage to each entry in every branch predictor structure.
For protection of current implementations, it is possible to use the ARMv8 architecture Device-nGnRnE or Intel UC memory to store secrets which the programmer wishes to be hidden from these types of attacks. In particular, it is highly likely secret data stored in Device-nGnRnE memory (nGnRnE=non-gathering, non-reordering, non-easy write acknowledgement) is completely immune to all variants of this basic attack on existing processors. It is illegal to speculate into, or speculatively remove, accesses to such Device-nGnRnE memory, and thus it would be extremely difficult to build a compliant implementation which leaks Device-nGnRnE data in the manner described.