Due to the increasing advance of science and technology, digitized information processing means plays a more and more important role on our daily lives and business activities. Consequently, the data processing amount is too huge to be operated by using a simple data processing device, such as a computer system with a single processor and a local memory. In order to efficiently deal with a large quantity of data, a multi-processor system is developed to solve this problem.
So far, two types of parallel data-processing systems have been used. One is the tightly coupled parallel data-processing system, and the other is loosely coupled parallel data-processing system.
The tightly coupled parallel data-processing system includes a plurality of central processing units (CPUs) and a memory accessible by all the CPUs. This architecture is extended from a single-CPU system so as to have a relatively simple design. Such system, however, has an inherent limit. Since the plurality of CPUs access the memory via a single common bus, the overall scale of the system cannot be too large. Aside from, the large number of CPUs will load heavy burden on the bus.
On the other hand, the loosely coupled parallel data-processing system is a system consisting of a plurality of computers interconnected via a high-speed network. Via a delicately designed topological architecture, the loosely coupled parallel data-processing system can be quite expansible, compared to the tightly coupled parallel data-processing system. In other words, a large number of processors can be included in the system. Since the communication of the entire system is conducted via network, the complexity of the architecture would be much more difficult than the tightly coupled parallel data-processing system in order to achieve high performance.
In order to solve the problems of the above systems, a processing system involving a distributed shared memory (DSM) is developed for parallel data-processing and rapid data-sharing purpose for a remote node to access a local memory. The DSM system has the advantages of both of the tightly and loosely coupled parallel data-processing systems. That is, the DSM system is simple and expansible. Since 1980, a plurality of DSM systems have been practiced. One of the examples is the cache coherency non-uniform memory access (ccNUMA) architecture.
Please refer to FIG. 1, which is a block diagram illustrating a conventional ccNUMA-type DSM system. The DSM system 10 includes four nodes 11˜14 interconnected by a network 15. The nodes 11˜14, as shown, include respective processors 111, 112, 121, 122, 131, 132, 141, 142, memory control chips 113, 123, 133, 143 for I/O control, local memories 1131, 1231, 1331, 1431, DSM controllers 114, 124, 134, 144, external caches or L3 caches 1141, 1241, 1341, 1441, system buses 115, 125, 135, 145, and internal buses 116, 126, 136, 146. Each of the local memories 1131, 1231, 1331, 1431 is divided into a plurality of local memory lines for separately storing data. Likewise, each of the caches 1141, 1241, 1341, 1441 is divided into a plurality of cache lines for separately storing cache data.
Each of the DSM controllers 114, 124, 134, 144 maintains a memory coherency directory stored therein (not shown) in order to realize the states of all the local memory lines. When any of the nodes is going to read data from a specific local memory line, the reading operation is guided by the DSM controller according to the memory coherency directory. The DSM controller also maintains a cache coherency directory stored therein (not shown) in order to realize the states of all the cache lines. When any of the nodes is going to read data from a specific cache line, the reading operation is guided by the DSM controller according to the cache coherency directory.
Since the DSM controllers of all nodes communicate with one another via the network 15, a network communication protocol such as TCP/IP would be used as the data transmission format for inter-communication. As is known to those skilled in the art, such communication protocol is complex and inefficient. For a DSM system consists of two nodes only, the communication complexity and performance are even adverse.