1. Field of the Invention
The present invention relates generally to data processing systems employing multiple instruction processors and more particularly relates to multiprocessor data processing systems employing a hardware doorbell type interface to indicate a new entry on a server work queue.
2. Description of the Prior Art
It is known in the art that the use of multiple instruction and input/output processors operating out of common memory can produce problems associated with the processing of obsolete memory data by a first processor after that memory data has been updated by a second processor. The first attempts at solving this problem tended to use logic to lock processors out of memory spaces being updated. Though this is appropriate for rudimentary applications, as systems become more complex, the additional hardware and/or operating time required for the setting and releasing of locks can not be justified, except for security purposes. Furthermore, reliance on such locks directly prohibits certain types of applications such as parallel processing.
The use of hierarchical memory systems tends to further compound the problem of data obsolescence. U.S. Pat. No. 4,056,844 issued to Izumi shows a rather early approach to a solution. The system of Izumi utilizes a buffer memory dedicated to each of the processors in the system. Each processor accesses a buffer address array to determine if a particular data element is present in its buffer memory. An additional bit is added to the buffer address array to indicate invalidity of the corresponding data stored in the buffer memory. A set invalidity bit indicates that the main storage has been altered at that location since loading of the buffer memory. The validity bits are set in accordance with the memory store cycle of each processor.
U.S. Pat. No. 4,349,871 issued to Lary describes a bussed architecture having multiple processing elements, each having a dedicated cache memory. According to the Lary design, each processing unit manages its own cache by monitoring the memory bus. Any invalidation of locally stored data is tagged to prevent use of obsolete data. The overhead associated with this approach is partially mitigated by the use of special purpose hardware and through interleaving the validity determination with memory accesses within the pipeline. Interleaving of invalidity determination is also employed in U.S. Pat. No. 4,525,777 issued to Webster et al.
Similar bussed approaches are shown in U.S. Pat. No. 4,843,542 issued to Dashiell et al, and in U.S. Pat. No. 4,755,930 issued to Wilson, Jr. et al. In employing each of these techniques, the individual processor has primary responsibility for monitoring the memory bus to maintain currency of its own cache data. U.S. Pat. No. 4,860,192 issued to Sachs et al, also employs a bussed architecture but partitions the local cache memory into instruction and operand modules.
U.S. Pat. No. 5,025,365 issued to Mathur et al, provides a much enhanced architecture for the basic bussed approach. In Mathur et al, as with the other bussed systems, each processing element has a dedicated cache resource. Similarly, the cache resource is responsible for monitoring the system bus for any collateral memory accesses which would invalidate local data. Mathur et al, provide a special snooping protocol which improves system throughput by updating local directories at times not necessarily coincident with cache accesses. Coherency is assured by the timing and protocol of the bus in conjunction with timing of the operation of the processing element.
An approach to the design of an integrated cache chip is shown in U.S. Pat. No. 5,025,366 issued to Baror. This device provides the cache memory and the control circuitry in a single package. The technique lends itself primarily to bussed architectures. U.S. Pat. No. 4,794,521 issued to Ziegler et al, shows a similar approach on a larger scale. The Ziegler et al, design permits an individual cache to interleave requests from multiple processors. This design resolves the data obsolescence issue by not dedicating cache memory to individual processors. Unfortunately, this provides a performance penalty in many applications because it tends to produce queuing of requests at a given cache module.
The use of a hierarchical memory system in a multiprocessor environment is also shown in U.S. Pat. No. 4,442,487 issued to Fletcher et al. In this approach, each processor has dedicated and shared caches at both the L1 or level closest to the processor and at the L2 or intermediate level. Memory is managed by permitting more than one processor to operate upon a single data block only when that data block is placed in shared cache. Data blocks in dedicated or private cache are essentially locked out until placed within a shared memory element. System level memory management is accomplished by a storage control element through which all requests to shared main memory (i.e. L3 level) are routed. An apparent improvement to this approach is shown in U.S. Pat. No. 4,807,110 issued to Pomerene et al. This improvement provides prefetching of data through the use of a shadow directory.
A further improvement to Fletcher et al, is seen in U.S. Pat. No. 5,023,776 issued to Gregor. In this system, performance can be enhanced through the use of store around L1 caches used along with special write buffers at the L2 intermediate level. This approach appears to require substantial additional hardware and entails yet more functions for the system storage controller.
Inherent in architectures which employ cache memory, is that the storage capacity is substantially less than the memory located at lower levels in the hierarchy. As a result, memory locations within the cache memory must often be cleared for use by other data quantities more recently needed by the instruction processor. For store-in cache memories, this means that those quantities modified by the instruction processor must first be rewritten to system memory before the corresponding location is available to store newly requested data. This xe2x80x9cflushingxe2x80x9d process tends to delay the availability of the newly requested data. Newer Input/Output interface protocols, such as InfiniBand, require the use of queue structures in main system memory to hold work request entries and a Doorbell type interface to inform the hardware that a new entry has been added. For best performance both the queue data and the Doorbell are to be located in the virtual address space of the application. There can be many applications with multiple work queues each, in typical system, that a single hardware unity will support.
Current state of the art for hardware Doorbells requires a single memory mapped register allocated on a software page boundary (typically 4k bytes) so the Operating System can manage the location in its normal virtual-to-physical address translation mechanism. This results in the waste of most of the page space needed for each Doorbell including a very large memory mapped space assigned to the hardware when multiple queues are in use. A lesser used option is to not use Doorbell but to require the hardware to poll each queue for flags indicating added entries. This requires additional memory bandwidth of the polling and increases the time between a single queue being investigated based on the is number of queues enabled.
The present invention overcomes the problems found in the prior art by providing a method of and apparatus for the cache memory coherency hardware to assist in generating the Doorbell type indication within a server platform.
The preferred mode of the present invention includes up to four main memory storage units. Each is coupled directly to each of up to four xe2x80x9cpodxe2x80x9ds. Each pod contains a level three cache memory coupled to each of the main memory storage units. Each pod may also accommodate up to two input/output modules.
Each pod may contain up to two sub-pods, wherein each sub-pod may contain up to two instruction processors. Each instruction processor has two separate level one cache memories (one for instructions and one for operands) coupled through a dedicated system controller, having a second level cache memory, to the level three cache memory of the pod.
Each instruction processor has a dedicated system controller associated therewith. A is separate dayclock is located within each system controller.
Unlike many prior art systems, both level one and level two cache memories are dedicated to an instruction processor within the preferred mode of the present invention. The level one cache memories are of two types. Each instruction processor has an instruction cache memory and an operand cache memory. The instruction cache memory is a read-only cache memory primarily having sequential access. The level one operand cache memory has read/write capability. In the read mode, it functions much as the level one instruction cache memory. In the write mode, it is a semi-store-in cache memory, because the level two cache memory is also dedicated to the instruction processor.
In accordance with the present invention, the level two cache memory is of the store-in type. Therefore, the most current value of an operand which is modified by the corresponding instruction processor is first located within the level two cache memory. When the replacement algorithm for the level two cache memory determines that the location of that operand must be made available for newly requested data, that operand must be xe2x80x9cflushedxe2x80x9d into the lower level memory to avoid a loss of the most current value.
Waiting for flushing of the old data before requesting the new data induces unacceptable latency. Therefore, according to the present invention, a flush buffer is provided for temporary storage of the old data during the flushing process. Though this temporary storage appears at first to be a mere extension to the level two storage capacity, it greatly enhances efficiency because the flush process really does not need to utilize the level two cache memory.
The old data is moved from the level two cache memory to the flush buffer as soon as the replacement algorithm has determined which data to move, and the newly requested data is requested from the lower level memory. The flush process subsequently occurs from the flush buffer to the lower level of memory without further reference to the level two cache. Furthermore, locations within the level two cache memory are made available for the newly requested data well before that data has been made available from the lower level memory.
In accordance with the preferred mode of the present invention, the hardware that handles the work queue is called a host channel adaptor (HCA), which is able to handle thousands of work queues at the same time. As the external interface speed increases, it may be incorporated closer into the systems memory controller/crossbar structure. An example of an HCA is an InfiniBand Host Channel Adaptor.
The use of Doorbells to alert the HCA of an entry on a queue is a commonly used procedure. Currently the use of a memory mapped register requires the reservation of a full memory page in order to assign the virtual address. With hundreds or thousands of queue pairs planned this can result in the use of a lot of potentially wasted pages, in addition to managerial frustration. When the HCA is fully integrated into the chip set, several alternatives are possible. A preferred option is to utilize the system coherency protocol, where the system cache uses a Modified/Exclusive/Shared/Invalid (MESI) type protocol for coherency.
Because the HCA maintains a copy of a cache line, anytime a processor updates (requests ownership) of the cache line, the HCA is informed via a snoop/purge operation. The hardware uses this as an internal signal that the queue has been updated. The cache line words can also include an Offset pointer to the next entry in the queue or some other indicator. By again obtaining a copy of the cache line, the data is returned and the alert is re-enabled. The software may have written multiple entries on the queue before the new copy was requested due to HCA workload. The other information in the cache line may contain counters or other data that maybe useful in processing the queue entries. A control type cache line is assigned to each queue.
Any cache line may be used, so the software only needs to pin the users normal page, register the virtual address with the HCA, and register a cache line within the addressed page as the header. The HCA will then request a copy and wait for the snoop operation. Because the logic is much like the normal cache logic in the HCA hardware, this function is easily integrated into the cache controller logic. Many cache lines can be maintained as copies. A platform that has a directory/snoop filter that can cover the full size of outstanding cached lines eliminates the unnecessary snoop/purge due to filter space age out. Simply deregistering the cache line and the HCA will ignore future snoops (except to honor the protocol).