In a multi-core processor, because processor cores all maintain their caches in a manner in which memory is shared, a problem of cache incoherence frequently occurs. This case happens when caches of different processor cores store data that corresponds to a same physical memory address but has different content. For example, in a shared memory system of a multi-core processor including processor cores A and B, each processor core maintains an independent Cache resource. It is assumed that the processor core A and the processor core B read data from a same physical memory address, that is, the data read by the two processor cores corresponds to a same physical memory unit. If the processor core A writes data to this address later, a Cache of the processor core A is updated while a Cache of the processor core B still stores previous data, thereby causing a problem of inconsistency of cached content.
In a conventional multi-core processor, a cache coherence issue is generally resolved by using a hardware cache coherence protocol. Common hardware cache coherence protocols include a Bus Sniffing, a directory structure based protocol, a token based protocol, and the like. However, as a quantity of cores of a many-core chip increases, costs of hardware cache coherence grow linearly with the increase of the core quantity, and even benefits brought by the increase of the core quantity are finally offset. The costs of hardware cache coherence mainly include several aspects as follows.
(1) Communication costs: To implement cache coherence, state update needs to be performed using a cache communication protocol, where researches show that on-chip communication traffic of a system implementing the hardware cache coherence protocol is 20% higher than that of a system implementing a non-cache coherence protocol, and as the core quantity increases, the situation deteriorates.
(2) Difficulties in design and verification: It is extremely difficult to implement state synchronization between hundreds of cores, and design complexity sharply increases design and verification costs.
Although the foregoing problem can be mitigated by using some smarter design, the problem cannot be thoroughly resolved. Therefore, software cache coherence is selected instead of hardware cache coherence, for example, many-core research chips such as a single chip cloud computer (SCC) and Teraflops of Intel® have eventually given up hardware cache coherence implementation.
A distributed shared memory (DSM) model is a mainstream memory model for implementing software cache coherence. As shown in FIG. 1, in this memory model, processes of an application program have same shared virtual memory, and each process separately maps some or all virtual memory pages in the shared virtual memory to a private physical memory space maintained by the process. Each process sees a complete shared virtual memory space from a user plane, and does not perceive that shared data included in a virtual memory page in the shared virtual memory space is actually in a private physical memory space maintained by another process. Each process may perform any data operation on the shared virtual memory, and a bottom layer of the DSM performs data synchronization between the processes using an on-chip network or shared physical memory, which can be accessed by all the processes, of a system. Multiple processes of an application program may run on one processor core, or each process may run on one separate processor core.
A scope coherence protocol is a mainstream DSM-based software cache coherence protocol, and has advantages of being simple and highly efficient. In an application program, ranges in which code is protected by Acquire(lock)/Release(lock) using a same lock belong to a same scope. The scope coherence protocol ensures only that shared variables in a same scope are synchronous, and shared variables in different scopes may be not synchronous. Moreover, in the scope coherence protocol, consistency of shared data in a same scope is generally maintained using a Twin/Diff (backup/comparison) mechanism. An existing Twin/Diff mechanism is implemented based on off-chip memory of a multi-core platform, where a Twin page is a page for backing up a current working page, when space of a cache is insufficient, the Twin page is stored in local off-chip memory, and after a process completes a write operation on the working page, a diff comparison operation is performed on the modified working page and the Twin page, and a comparison result is sent to a home process of the working page, such that the home process updates the working page.
In the existing Twin/Diff mechanism, if a program accesses a large quantity of pages in a Scope, because of a limitation from a size of a cache, according to a cache replacement algorithm, a page accessed later removes a page (a working page and a Twin page) accessed earlier from the cache; in this way, when the program exits the Scope, when a Diff operation is performed for the page accessed earlier, it is required to reload the working page and the Twin page from off-chip memory to the cache, which causes large off-chip access load while increasing a data access latency and affecting execution efficiency of the program.