Modern processors are incorporating ever greater amounts of circuitry within a single processor. For example, multi-core and many-core processors are being introduced that include a number of different processing cores or engines, in addition to other logic, internal memory such as one or more levels of cache memory storage, and so forth. In addition, such processors are often programmed to perform multiple threads of execution concurrently. Also, individual processors can be connected together in multichip clusters.
As a result, maintaining full compliance with memory ordering rules of an instruction set architecture (ISA) while still providing efficiency in memory accesses is becoming very difficult. The term “memory ordering” refers to the order in which a processor issues reads (loads) and writes (stores) to system memory. Different processor architectures support different memory-ordering models depending on the architecture. In so-called program or strong ordering, reads and writes are issued in program order. To allow performance optimization of instruction execution, some architectures provide for a memory-ordering model that allows for performance enhancements such as allowing reads to proceed ahead of buffered writes.
Since data may be used by different agents in the same or different processors, copies of data can be stored in various locations of a system, e.g., one or more cache memories associated with different processors. However, a coherent view of the data is to be maintained. While efficient cache line sharing between all agents in a system is a design goal, some of the data may only be used locally or shared only by a few threads. Some regions of memory can be defined as cacheable, meaning that when a processor seeks to write data to a memory, the data can initially be stored in a cache memory associated with the processor, without immediately writing the data to the system memory. Thus, data can be stored in a cache and also a store can occur to such a cache. In contrast, an uncacheable memory region immediately causes the requested write operation to be written to the memory, and data of an uncacheable request is not stored in a cache. In current systems, any memory access to cacheable memory may have to look up a given memory location in all caches in the system, which increases latency and traffic. To reduce this overhead, mechanisms such as snoop filters are implemented in hardware, but these resources can consume a large chip area, particularly as the number of collaborative agents increase.