Computing systems are able to concurrently process multiple tasks. Concurrent processing of multiple tasks can be effected in various ways. To name a few, different processors (such as different processing cores on a same semiconductor chip, or different processors implemented on different semiconductor chips) may execute their own respective threads over a same expanse of time. In a more fine-grained fashion, a multi-threaded processing core and/or instruction execution pipeline may concurrently execute different threads.
The ability to concurrently execute different threads leads to additional complexities when the processes of the different threads may use and/or rely on the same item of data. For example, if one thread changes an item of data, the system needs to ensure that another thread that seeks to access the same item of data will be provided the latest, updated version of the data rather than a stale, previous version of the data.
Locking is a technique that has been traditionally used by computing systems to address the need to handle operations made by one thread that could have an effect on the processes of other threads. Locking is a primitive of guaranteed system behavior that can be effected into the execution of a particular instruction. Specifically, for any instruction of a particular thread that is declared as a “locked” instruction, the system guarantees, to effect it's “bus lock protocol”, that the effects of the instruction (such as a change made to a data item) are visible at once to other threads within the system. As such, threads that did not execute the instruction (but could nevertheless by impacted by the instruction) can, ideally, equally observe the effects of the instruction. Such behavior is described by those of ordinary skill as “the atomicity of a locked instruction”.
Two ways to effect the atomicity of a locked instruction into the behavior of modern day processors and processing cores include: i) cache locks; and, ii) bus locks. For simplicity, hereafter, the term “processing core” will be used to refer to a processor or a processing core. Processing cores are understood to include a local cache. When an operation is to be performed on an item of data by one of the threads supported by a processing core, the processing core looks to its local cache before looking to system memory for the item of data. Items of data are organized into a cache through the use of “cache lines”. A cache line typically includes more than one separately addressable item of data. In general, cache lock execution is much faster than bus lock execution. Cache locks delivers higher performance than bus locks.
In the case of a cache lock, a thread that executes a locked instruction is given full ownership of the sought for data item's cache line as part of the guaranteed system behavior. If the data item is not found in the cache a bus lock will commence. Alternatively, a bus lock will commence if the address of the sought for item of data crosses a cache line boundary. In this case, a cache snoop is not even attempted and the thread is not given full ownership of any cache line. In some processor implementations the memory type is also factored into whether or not a locked instruction is executed as a cache lock or a bus lock. Some processors might be designed such that all data items even if they are found in the cache have their atomicity handled as bus locks for some memory types.
In the case of a bus lock, all other threads are stopped until the operations upon the item of data by the thread that executed the locked instruction are complete. Here, the term “bus lock” is utilized even if a true “bus” does not exist between the core and system memory (e.g., the core is coupled to the system memory's memory controller through a point-to-point link).
The stopping of all other threads dramatically reduces the performance of the computing system. As such, programmers try to write code that avoids the occurrence of bus locks. Nevertheless, owing to the sheer complexity/impossibility of fully defining and comprehending instruction level behavior pre run-time, bus locks remain a run-time possibility. Also, the processor is typically designed to support bus locks for software backwards compatibility reasons.
In terms of designing software, or even multi-core shared data hardware designs, one possible design environment is to trigger the execution of special micro-code anytime a locked instruction is executed. Should the flow resulting from the locked instruction result in a bus lock, the micro-code will raise a flag that is detected by the software. Upon analysis of the state of the system leading up to the bus lock, software designers can try to re-design the software to effect a different system so as to avoid the bus lock, and/or, CPU designers can change their existing shared data intra core protocol logic design to prevent the bus lock in same/similar circumstances.
A problem with this approach is that the execution of the special micro-code effectively weighs on system performance, and, is executed even if a bus lock does not arise. That is, the micro-code will execute even if the normal flow results in a successful cache lock and not any bus lock.