For a multi-processor system, multiple processors share the memory space in the system. Currently, a connection manner of multiple processors is changed from bus connection to point-to-point connection, and a memory is also directly hooked to the processor instead of being hooked to an external bridge chip of the processor. Because of the change of memory hooking manner, distribution of the memory in the system is also changed, thereby causing non-uniformity of memory access in the multi-processor system, and therefore, current multi-processor systems are mostly Non-Uniform Memory Access (NUMA) architecture systems.
The NUMA architecture multi-processor system has the following 3 important characteristics:
1. all memories are addressed uniformly, so as to form a unified memory space;
2. all processors can access all addresses in the memory space;
3. accessing a remote memory is slower than accessing a local memory.
The NUMA system has multiple cache units distributed in the system, and therefore, the NUMA system shall be designed to solve the problem of coherence between multiple caches. A NUMA system satisfying cache coherence is also referred to as a Cache Coherent Non-Uniform Memory Access (CC-NUMA) system. How to solve the problem of cache coherence is a core problem of the CC-NUMA system.
Currently, the processor is directly hooked to the memory, and the processor supports a cache coherence protocol; therefore, in one solution, the processors are directly interconnected to form a multi-processor system, and cache coherence between the processors may be guaranteed by cache coherence protocol maintenance engines of the processors, so as to form a single cache coherency domain. In the single cache coherency domain, various processors are identified and recognized by using processor ID numbers. However, a multi-processor system organized in this manner has a limited scale, because every processor occupies at least one processor ID number in the cache coherency domain, and the number of processor ID numbers that can be distinguished by every processor is limited. For example, a processor can distinguish 4 processor ID numbers, that is, the processor can support direct interconnection of at most 4 processors in the domain. For another example, a processor can only distinguish 2 processor IDs, and can only support two processors in a cache coherency domain. Moreover, due to physical limits and price limits, the number of interconnection ports of processors are also limited, in some circumstances, although the number of processor IDs supported by the processor in the single cache coherency domain can meet the requirement, the direct connection causes large hop counts and delay for cross-processor memory access, and therefore, a high-efficient multi-processor system cannot be formed.
Parameter configuration of processors, the number of interconnection ports and the number of supportable processor IDs are closely related to a pricing system of processors. Generally, the less numbers of the interconnection ports and processor IDs are supported by a processor, the cheaper the price is. A processor supporting 2 processor IDs is cheaper than a processor supporting 4 processor IDs in a domain.
As described above, the multi-processor system formed in the processor direct connection manner has a limited scale. In order to implement a CC-NUMA multi-processor system having a larger scale, node controllers are required. The node controller functions to expand the system scale and maintain global cache coherence. First, each node controller is connected to 1 to 4 processors, no as to form a node and a first-level cache coherency domain, and the intra-domain cache coherence is collectively maintained by the processors and the node controller. The node controller also occupies at least one processor ID in the domain, and therefore, the sum of the numbers of the processors and the node controller in the domain cannot be greater than the number of processor IDs supportable by the processor in the domain. Then, the node controllers are directly interconnected or are connected by using a node router to form a large-scale CC-NUMA system. Second-level cache coherence between nodes is maintained by the node controllers, and when a processor in a certain node accesses a memory of a processor in another node across nodes and cache coherency domains, global cache coherence is maintained by the node controllers.
The CC-NUMA system uses the node controllers to expand the system scale and maintain the global cache coherence, which increases overheads of cross-domain processing and inter-domain communication, resulting in significant reduction of remote memory accessing, and the larger the system scale is, the more obvious the reduction is. If a CC-NUMA system formed by 64 processors is built, two solutions may be used, in solution 1, there are 4 processors in a coherency domain of each node, and therefore, at least 16 node controllers are required for the whole system. In solution 2, a processor only supporting 2 processor IDs in the domain may be used, and therefore, in one node, the cache coherency domain can only be formed by one processor and one node controller, so that at least 64 node controllers are required. So many node controllers result in a very huge interconnection scale of nodes and much more complicated inter-node topology; therefore, the speed of cross-node accessing a remote memory is obviously deteriorated, thereby causing a rapid reduction of system efficiency and a tremendous loss of performance.
It can be seen that, for a multi-node multi-processor system, reducing the number of nodes plays a direct and significant role in reducing interconnection scale of nodes and simplifying inter-node topology, especially for a processor that can support very limited numbers of interconnection ports and processor IDs in a domain. Therefore, whether the number of node controllers can be reduced effectively is a very significant and urgent technical problem to be solved.