In a multiprocessor that comprises a desired number n of processing cores, where the term “processing core” refers to one processor contained in the multiprocessor, it is known to use one or more shared memory spaces. There is a need to implement in reality one or more apparatuses that constitute these shared memory spaces. As shown in the prior art FIG. 1, a known technique to implement a shared memory space is to divide the space among a desired number k of partial banks, and connect the n processing cores to the k memory banks through an interconnection network.
The prior art shared memory system of FIG. 1 comprises an interconnection network 20 and a collection of memory banks 22, depicted alongside the attached processing cores 24. The processing cores 24, denoted P1, P2, . . . Pn in FIG. 1, are connected through the interconnection network 20 to the memory banks 22, denoted B1, B2, . . . Bk. The union of the individual memory spaces embodied by these banks constitutes the shared memory space.
Many different implementations of the interconnection network appearing in FIG. 1 are known in the prior art. Nevertheless, according to the prior art, it is not considered possible to build a shared memory system that allows tens of processing cores or more to concurrently access random addresses within the shared memory with a degree of efficiency comparable to the degree achieved by a single processing core accessing a local private memory.
A representative summary of the prior art can be found in the statement on page 638 of the second edition of the book “Computer Architecture—a Quantitative Approach”, written by John L. Hennessy and David A. Patterson and Published in San Francisco in 1996 by Morgan Kaufmann Publishers Inc. The statement reads as follows: “To support larger processor counts, memory must be distributed among the processors rather than centralized; otherwise the memory system would not be able to support the bandwidth demands of a larger number of processors”.
According to the prior art, the interconnection network is conceived to be a complicated and cumbersome apparatus, which cannot provide bandwidth high enough and at the same time abide by latencies low enough so as to allow access to random addresses with efficiency comparable to that of a local memory. This shortcoming is related to yet another fact concerning the prior art of multiprocessor (also called “multicore”) computer construction, which is the following: the activity of synchronization and scheduling in the multicore computer is usually performed through the shared memory.
Thus, one common design characteristic of prior art multiprocessors that typically leads to a hindrance in performance is laying much of the burden that stems from the synchronization and scheduling activity on the shared memory system. This activity must be conducted in one way or another in every multiprocessor. Performing the synchronization and scheduling through the shared memory system, besides imposing a burden, also impairs the efficiency of this system. This is particularly due to the development of what is known by those skilled in the art as “hot spots”; in this case, those that are related to synchronization and scheduling. Also, a complication of the shared memory system may ensue from the demand to support special synchronization primitives, such as Test&Set, Fetch&Add, or others. Such primitives typically require read and update operations that are inseparable from each other, and are performed in what is known by those skilled in the art as an “atomic” manner.
Another design characteristic of prior art multiprocessors which may also lead to a hindrance in performance is demanding that the shared memory system support read-modify transactions rather than just support simple reads and writes.
In general, prior art multicore computers which are designed so that the synchronization and scheduling activity is performed through the shared memory, are not built with an aspiration that the efficiency of accessing the shared memory will be comparable to that of accessing a local memory. This is because such computers cannot support fine computational granularity. This problem can be restated as follows: Decomposing a given algorithm into ever finer granularity levels will yield an ever increasing demand for synchronization rate, and an ever bigger ratio of overhead-activity to productive computation.
When the multicore computer is intended from the outset to perform parallel computations at a limited level of granularity, this limitation typically leads to constructing the computer in such a way that processing cores work mainly against their own local memories.
Thus, according to the prior art (see Hennessy and Patterson, cited above), it is not considered possible to build a shared memory system that allows tens of processing cores or more to concurrently access random addresses within a shared memory with a degree of efficiency comparable to that achieved by a single processing core accessing a local private memory.
A prior art attempt to overcome the problems associated with synchronization and scheduling activities was described in U.S. Pat. No. 5,202,987 to one of the co-inventors of the present invention who was also a co-inventor of the cited patent. That patent describes a multicore computer design equipped with a dedicated apparatus for synchronization and scheduling. The need to build a shared memory system that allows access with an efficiency comparable to that of a local memory arises in relation to the apparatus described in U.S. Pat. No. 5,202,987.
Prior to the publication of U.S. Pat. No. 5,202,987, both the feasibility and the need of providing a shared memory system that allows access with efficiency comparable to that of a local memory in multicore computers had not yet been established. It would therefore be desirable to provide a shared memory system that allows such an efficiency by implementing the present invention in conjunction with the invention disclosed in U.S. Pat. No. 5,202,987 to Bayer, et al., one of the co-inventors of the present invention, which is incorporated herein by way of reference.