A modern advanced feature of a system architecture enables a processor (central processing unit (CPU)) to have error report and error correction capabilities, and a CPU hot swap technology is supported. Some original equipment manufacturers have supported hot swap of non-uniform memory access (NUMA) hardware, that is, insertion and removal of a physical node. This advanced feature requires that a kernel can remove, if necessary, a CPU that is being used. For example, to meet a requirement of a remote access service (RAS), a CPU that executes malicious code must be kept out of a system execution path. Therefore, a LINUX kernel needs to support the CPU hot swap technology. An operating system (OS) takes the CPU logically offline, the operating system no longer uses a CPU thread that is taken offline, and a process and an interrupt that are originally bound to the CPU thread are also migrated to another thread.
In a case based on multi-node interconnection, hot removal may be performed on a node controller (NC) on a node or be performed on a CPU. If logical and physical removal needs to be performed on the NC, in addition to the operation of removing a CPU introduced previously, the OS further takes memory of the node offline, where the OS migrates data that is being used in address space of the node to memory of another node, and no longer allocates new memory space to this address segment. Assume that there are an NC0, an NC1, an NC2 and an NC3 in a system and removal is performed on the NC3. After all services in all CPUs on an NC3 node are migrated, nothing runs in the CPUs of the NC3 node, other nodes do not use memory of the NC3 node, and the NC3 node does not access memory of the other nodes. However, because there is directory information on an NC, information about the NC3 previously occupying memory data on the other nodes may be reserved.
Assuming that data at a memory address Addr0 on the NC0 is occupied by the NC3, the following cases exist when logical removal is performed on the NC3:
TABLE 1DIR StatusNC3NC2NC1NC0Addr0E1000 Addr0I0000Data at an Addr0 on the NC3 is modified to be in an I state
Table 1 indicates that a CPU on the NC3 exclusively occupies the memory address Addr0 on the NC0, and then, an E state and exclusive occupation by the NC3 are recorded as directory information of the NC0. If the CPU on the NC3 modifies the data at the address, when the logical removal is performed on the NC3, the data is written back to memory of a CPU on the NC0, and the directory information is updated to be in an I state.
TABLE 2DIR StatusNC3NC2NC1NC0Addr0E1000 Addr0E1000Data at the Addr0 on the NC3 is not modified
Table 2 indicates that a CPU on the NC3 exclusively occupies the memory address Addr0 on the NC0, and then, an E state and exclusive occupation by the NC3 are recorded as directory information of the NC0. However, because the CPU on the NC3 does not modify the data at the address, when the logical removal is performed on the NC3, the data is not written back to memory of a CPU on the NC0, and the directory information still indicates that the NC3 exclusively occupies the data at the Addr0.
TABLE 3DIR StatusNC3NC2NC1NC0Addr0S1010 Addr0S1010
Table 3 indicates that a CPU on the NC3 shares the memory address Addr0 on the NC0, and then, an S state and sharing by the NC3 and the NC1 are recorded as directory information of the NC0. If the data is not written back to memory of a CPU on the NC0 when logical removal is performed on an NC node, the directory information still indicates that the NC3 and the NC1 share the data at the Addr0.
In the last two cases, if the directory information on the NC0 is not updated, and if a CPU0 on the NC0 needs to exclusively occupy the data at the Addr0, a snoop message is sent to the NC3 according to a CC protocol. In this case, if the NC3 has been physically removed, the snoop message cannot be responded, a system is suspended consequently.
An existing solution is, before the physical removal is performed on the NC3, CPUs on other NC nodes all send, to a remote node, an exclusive-occupation request regarding space of all memory addresses of the nodes. After all the memory addresses are updated in this manner, directory statuses in Table 2 and Table 3 are respectively changed to those shown in the following Table 4 and Table 5.
TABLE 4DIR StatusNC3NC2NC1NC0Addr0E1000 Addr0I0000
TABLE 5DIR StatusNC3NC2NC1NC0Addr0S1010 Addr0I0000
The other nodes no longer have directory status information about occupation by the NC3, and all directory statuses change to invalid statuses. Performing the physical removal on the NC3 in this case may ensure that the system is not suspended not crash.
However, because other nodes need to update local memories one time when this method is applied to removal of a node, it takes excessive usage time of an OS, causing extremely slow response of the system and greatly degrading system performance. In actual tests, if 256 gigabyte (GB) of memory on a single node is updated, and a basic input/output system (BIOS) occupies 60% to 70% of CPU time slices, it requires about 20 minutes to complete the updating, and during this period, OS response becomes extremely slow, which is basically unacceptable to a user. In addition, a greater memory size of a single node and a larger system scale lead to longer time required for updating the memory.