A method for developing the most appropriate affinity relationship between multiple processors which can be oriented to a particular one bus of multiple busses in order to optimize operations and throughput. A determination method is used to establish which processors are operating on which particular one of the multiple busses so that appropriate balancing can occur across the buses.
On high-end multi-processor systems, there are generally provided a multiplicity of busses. In many cases there are at least two busses which are used by different groups or sets of processors. A problem often arises in that the operating system does not report or provide any indication of the particular system bus on which each particular processor resides for operation. This is important information when optimizing the utilization of the system for what is called affinity management, where it is necessary to determine or dictate which processor, or range of processors, a particular software application is allowed to use.
If one group of processors resides on one of the system busses, and another group of processors resides on another system bus, then there are quite considerable problems with cache coherency, since then cross-bus operations for invalidation of cache data are required which are time-consuming and which degrade the throughput. The most appropriate approach in these type of multiple bus, multiple processor systems is to perform a performance tuning which will balance the various applications across the system busses in a manner which will minimize the effects of caching overhead, especially for minimizing cross-bus traffic on cache invalidation operations. Balancing applications infers the minimizing of cross-bus cache invalidation overhead by placing applications for sharing of cached data onto processors that reside on the same system bus.
In the situations where multiple processors, each having first and second level cache units have certain processors residing on different busses, it is required that when a processor on one bus has initiated a cycle for cache memory invalidation, it is then necessary to crossover to the other bus to ensure that invalid data does not continue reside in the Central Processing Units on the other bus. This crossover operation is effectuated by Cache Coherency Boards.
In some platforms such as Windows NT, it is sometimes possible to set what is called the xe2x80x9caffinityxe2x80x9d by directing that a certain program or application only run, for example, on CPU 1 and 2, and this minimizes chances of requiring an invalidation cycle for cache units to run on different busses which then provides for superior performance.
However, another situation arises when for example, there may be 8 or 10 processors with multiple busses. In this situation, the xe2x80x9corderxe2x80x9d in which the CPU""s come online is not the same as that of the processor""s number ID. So, when looking at the operator screen, it is not possible to tell which processors are working on the same bus or another bus so that it is then difficult to execute an affinity operation.
Since there is no orderly numbering arrangement involved, and because of the way that the processors come online in an NT platform, the NT platform will number them in the order on which they initialize and come onto one of the busses.
In situations where there is only one system bus, for example, a four-processor system which has only one system bus, there is no requirement on how to set the affinities. However, if there are two busses involved working with different groups of processors, then it is necessary to cross from one, say the left side CPU bus to the right side CPU bus, and go through cache coherency boards and the chipset of the other bus which leads to much undesirable latency to complete cache invalidation operations.
The optimum situation is what is called a balancing across the busses, so there is a balance of operations between the loads provided on one bus, and the loads provided on the other bus by the operating processors.
Thus, a considerable problem arises in how is it possible to balance the load on the busses as against the operating processors, and also how to minimize the need for cross-bus invalidation operations and how to maximize the bandwidth of each of the busses.
In the NT platforms, the processors do not self-identify themselves according to the numbers involved. The Intel processors, for example, do not provide one with a CPU number and they do not indicate onto which bus a CPU will be operating in a multiple bus system. Thus, at any given period, it is not possible to know whether a particular processor is connecting its operations on the left side bus or the right side bus, and thus a question often arose, how do you tell if two particular processors are on the same system bus, or not?
Another type of problem often arises which is indicated when two Central Processing Units (CPU""s) are sharing the same data. In this case, there is required an operation whereby the system has to go back and forth on different busses to operate the caches in the various different CPU""s working on the various different busses involved. If the memory arrays are coded so that the data is being shared by the processors, the latency times for cache invalidation is worse if there are two separate busses involved. Then, this requires a cache invalidation operation over to a cache coherency board thus to find that the data (to be invalidated) is over on the other side of this bus and the invalidation data must be transferred to the other bus.
It may be noted that if there is no balancing of loads across the multiple busses there is thus a lack of affinity in the multi-processor, multi-bus system, then there could be a degradation in performance of 20% to 30% because of the need to cross the CPU operations from one bus onto the other bus. However, if two CPU""s are on the same bus, then this is an xe2x80x9caffinityxe2x80x9d type of operation and the throughput and cache invalidation operations will be 20% to 30% better.
Another occasion when the problem arises, is that each time the system is rebooted or re-initialized, the processors would come and connect-up in a different numbered order so that one group of processors would be connected to one bus, and another group of processors would be connected to another bus, but there was no indication to the operator or user of this system as to which processor would be operating with which bus at that particular time after the rebooting.
Thus, it is a very desirable thing for optimization of system operations, that there be set-up a method for creating a xe2x80x9cmappingxe2x80x9d of the processor and busses on the system so that then certain operations with a certain group of processors can be relegated to one bus and other groups of operations and processors can be relegated to operations on the other bus which then provides for a tuning and balancing of loads in the system without undue interference of one bus with the other bus.
There is provided herein a method to determine which processors reside on the very same system bus in multi-processor systems having more than one system bus. As a result of this, it is then possible to xe2x80x9caffinitixe2x80x9d or operate the system applications in a more efficient manner by causing operations of one set of applications to operate on processors residing on one system bus, and the operations of other sets of applications to operate on another set of processors residing on a second system bus. This will provide for minimization of cross-bus invalidation traffic, thus to produce latency of operations and also to provide for greater throughput by greater efficiency of usage of each processor set.
The method takes advantage of the concept known as xe2x80x9cfalse sharingxe2x80x9d, which is used to determine where each of the processors reside (i.e., on the first bus or the second bus). False sharing involves a situation where each of the processors have two caches L1 and L2, and where the L2 cache is shared by the data in the L1 cache. Then two separate threads can be made to execute on two selected different processors while the system is accessing and updating data in the same arrays at the same time. From this, the time spent performing this operation is noted and recorded for each and every one of all possible combinations of pairs of processors on each of the system busses.
Once the throughput timings are collected for all possible processor combinations and the data is analyzed, then the various affinity connections and system bus split operations become obvious, since there will be seen to be a very large performance penalty in certain cases where the pair of processors reside on different system busses, while in cases where the pair of processors reside on the same system bus, there will be a highly efficient throughput. At this stage it is now possible to know which particular CPU""s are operating on which particular one of the busses.