This invention is related to computer systems and particularly to a high speed remote storage controller.
Today""s e-business environment places great demands on the computer systems that drive their infrastructure. This is especially true in the areas of system performance and availability due in large part to the increasing amount of data sharing and transaction processing inherent in large system applications. Another aspect of the e-business infrastructure is the unpredictability of the workloads which mandate the underlying computer systems to be highly scalable. However, the importance of additional performance and salability must always be tempered by the cost of the systems.
Historically system architects have used various means to achieve high performance in large tightly coupled symmetrical multiprocessor (SMP) computer systems. They range from coupling individual processors or processor clusters via a single shared system bus, to coupling processors together in a cluster, whereby the clusters communicate using a cluster-to-cluster interface, to a centrally interconnected network where parallel systems built around a large number (i.e. 32 to 1024) of processors are interconnected via a central switch (i.e. a crossbar switch).
The shared bus method usually provides the most cost efficient system design since a single bus protocol can service multiple types of resources. Furthermore, additional processors, clusters or peripheral devices can be attached economically to the bus to grow the system. However, in large systems the congestion on the system bus coupled with the arbitration overhead tends to degrade overall system performance and yield low SMP efficiency. These problems can be formidable for symmetric multiprocessor systems employing numerous processors, especially if they are running at frequencies that are two to four times faster than the supporting memory subsystem.
The centrally interconnected system usually offers the advantage of equal latency to shared resources for all processors in the system. In an ideal system, equal latency allows multiple applications, or parallel threads within an application, to be distributed among the available processors without any foreknowledge of the system structure or memory hierarchy. These types of systems are generally implemented using one or more large crossbar switches to route data between the processors and memory. The underlying design often translates into large pin packaging requirements and the need for expensive component packaging. In addition, it can be difficult to implement an effective shared cache structure.
The tightly coupled clustering method serves as the compromise solution. In this application, the term cluster refers to a collection of processors sharing a single main memory, and whereby any processor in the system can access any portion of the main memory, regardless of its affinity to a particular cluster. Unlike Non-Uniform Memory Access (NUMA) architectures, the clusters referred to in our examples utilize dedicated hardware to maintain data coherency between the memory and the hierarchical caches located within each cluster, thus presenting a unified single image to the software, void of any memory hierarchy or physical partitions such as memory bank interleaves. One advantage of these systems is that the tightly coupled nature of the processors within a cluster provides excellent performance when the data remains in close proximity to the processors that need it such as the case when data resides in a cluster""s shared cache or the memory bank interleaves attached to that cluster. In addition, it usually leads to more cost-efficient packaging when compared to the large N-way crossbar switches found in the central interconnection systems. However, the clustering method can lead to poor performance if processors frequently require data from other clusters, and the ensuing latency is significant, or the bandwidth is inadequate.
One of the ways to combat the performance problem is the use of large shared caches within each cluster. Shared caches are inherently more efficient in large data sharing applications such as those typical of the e-business environment. But even in the most efficient system, the need eventually arises to transfer data across clusters. Therefore, system performance in these types of computer structures can be influenced by the latency involved with cross cluster data transfers. Historically, system performance issues tended to focus on processor fetch operations and minimizing the associated latency of data fetches from the hierarchical caches and main memory.
However, in complex systems like the IBM e-server Z-Series, the fetch is typically just one piece contributing to the system performance. For example, a fetch may necessitate casting aged data out of a clustered cache to make room for the desired fetch data. In addition, one processor""s fetch may be competing for the inter nodal data busses with work from the other processors and/or I/O adapters. These operations involve not only fetches for other processors, but cast outs of aged data from a cache on one cluster to main memory on the remote cluster or fetches and stores from the I/O adapters. The need to accommodate all these types of inter nodal operations demands a multitude of large data busses between the clusters. Unfortunately packaging restrictions typically limit the amount of available bandwidth on the inter nodal data bus. Therefore, to truly maximize overall system throughput, performance improvements must be made to all types of inter nodal data transfers, not just processor fetches.
With the disparate rate of advance between processor next generation processors and memory, components such as the system memory controller become increasingly more valuable to overall system throughput. The inventions cited herein provide many improvements in the area of memory and the corresponding controllers, however they fail, both independently and in conjunction with each other, to address all aspects found in the present invention.
U.S. Pat. No. 5,664,162, entitled Graphics Accelerator with Dual Memory Controller, focuses on performing memory accesses with respect to a graphics processor. This invention teaches improvements pertaining to address format translations, frame buffer remapping, object drawing and other tasks related to rendering graphical images using a computer system. U.S. Pat. No. 5,239,639, entitled Efficient Memory Controller with an Independent Clock, provides a means to synchronize the timing of a memory controller with a CPU, without requiring the memory controller and CPU to share the same operating frequency. U.S. Pat. No. 5,896,492, entitled Maintaining Data Coherency Between a Primary Memory Controller and a Backup Memory Controller, describes a fault tolerant memory controller to ensure data availability in the event of a memory controller failure.
U.S. Pat. No. 5,835,947, entitled Central Processing Unit and Method for Improving Instruction Cache Miss Latencies Using an Instruction Buffer Which Conditionally Stores Additional Addresses, U.S. Pat. No. 3,611,315, entitled Memory Controller System for Controlling a Buffer Memory, and U.S. Pat. No. 5,778,422, entitled Data Processing System Memory Controller that Selectively Caches Data Associated with Write Requests, all concentrate on pre fetching instructions or caching data accesses into memory buffers to reduce latency on subsequent CPU fetches. Although the aforementioned inventions teach various improvements in memory controllers, they all fail to address performance issues associated with accessing a shared memory in a symmetric multiprocessing (SMP) computer system.
U.S. Pat. No. 5,752,066, entitled Data Processing System Utilizing Programmable Microprogram Memory Controller, describes a single system-level interface to be presented to the operating system and application programs by allowing a plurality of memory configurations to be reprogrammed via micro code. Unlike our invention, this one provides a means to enhance or alter the functionality of the memory controller without the need to change the hardware, whereas our invention focuses mainly on solving performance issues associated with concurrent memory accesses in an SMP computer system. One skilled in the art would appreciate how the two inventions address unrelated topics, yet could be combined with each other to offer additional improvements upon each invention.
Finally, U.S. Pat. No. 5,815,167, entitled Method and Apparatus for Providing Concurrent Access by a Plurality of Agents to a Shared Memory, focuses on providing simultaneous access to a shared main memory by a memory controller and a graphics controller. Said invention achieves this by providing a dual data path and partitioning the memory into a section for system access and a frame buffer for use by the graphics controller. This invention, as well as all those cited in the prior art, fail to provide a means of improving general data accesses to a unified main memory equally accessible by a plurality of central and I/O processing units in a symmetric multiprocessing computer system. Furthermore, they fail to address aspects related to maintaining proper shared cache coherency in such an environment.
The present Remote Storage Controller performs various storage operations and associated cache management functions on behalf of a requesting controller located on a remote cluster. The techniques described herein enable a multitude of operations to occur in a concurrent and high speed manner using a minimal of external control signals.
The present invention describes a unified Remote Storage Controller (known hereto forth as RSAR) which handles all types of inter nodal storage operations. This controller employs an optimized cache coherency scheme and the principles described in U.S. Pat. No. 6,038,651 entitled SMP Clusters with Remote Resource Management for Distributing Work to Other Clusters while Reducing Bus Traffic to a Minimum. The Remote Storage Controller system enlists a single controller to perform remote cast outs, store requests from an I/O adapter, main storage padding operations, and main memory move page operations. Although the primary role of RSAR is to perform remote data storage operations to main memory, it also handles cross cluster invalidations associated with maintaining BI-nodal cache coherency.
The preferred embodiment is incorporated into an Symmetric Multiprocessing System comprising a plurality of Central Processors, each having a private L1 cache, a plurality of I/O Adapters, and a main memory wherein any Processor or I/O Adapter can access any portion of the memory. The total number of Processors and I/O Adapters are divided equally into two clusters. In addition, the main memory is comprised of banks or interleaves, half of which are attached to each cluster.
Within each cluster there exists a System Controller which consists of a system coherency management unit, shared cluster cache, various controllers, multiport data switch, and discrete interfaces (or ports) to every Processor, I/O Adapter, and the main memory. The cache represented in the present embodiment is comprised of a plurality of banks or interleaves and the contents are managed by a 16-way associative directory. The System Controller depicted in FIG. 1 illustrates the major functional elements and will be described further in the detailed description of the preferred embodiment. However, a brief overview of the System Controller within a single cluster is beneficial in understanding the aspects of the present invention.
The primary function of the System Controller is to process data fetch and store requests coherently between the Processors and I/O Adapters and the system""s main memory. Since the System Controller contains a shared cache, which is architecturally invisible to the software and operating system, the System Controller is also responsible for performing directory and cache accesses. All incoming requests enter a port on the System Controller, where they are received by a Central Processor (CFAR) or I/O Controller. These controllers generate requests into a Central Priority unit which arbitrates among them and chooses one of the requesters to enter into one of two multistage Pipelines based on the address. During each stage of the pipeline the requester accesses and/or reserves various resources such as the cache, the Local Cache Fetch/Store Controllers, the data path controls, data path fifo buffers, the Remote Cache Fetch/Store Controllers, etc.
As requests exit the pipeline, one of the Local Fetch/Store Controllers assumes responsibility for managing the operation through completion. Often this requires additional passes through the pipeline, therefore a Local Fetch/Store Controller must also participate in Central Priority arbitration, and is also considered a requester. In the present embodiment, we include the Cache Controller and the Main Memory Controller, as part of the Local Fetch/Store Controllers, Between them they contain all the resources (including data path elements such as fifo buffers and cross point switches) necessary to access data from the cache interleaves, process data accesses to main memory when cache misses occur, perform store operations into the cache interleaves, and cast out aged data (using a Least Recently Used method) from the cache into main memory in order to make room for incoming data from main memory accesses.
As stated above, the main memory banks are physically distributed between the two clusters of the BI-nodal system. However, the main memory appears as a single unified entity to any of the Processors or I/O Adapters located anywhere in the SMP system. Therefore, the present embodiment incorporates an additional set of controllers, known as Remote Fetch/Store Controllers. The System Controller keeps track of which main memory addresses are assigned to the memory banks on each cluster. Whenever data accesses (fetch requests) miss the cache on the local cluster, (where the term local refers to the cluster to which the originating Processor or I/O Adapter is attached), the Local Fetch/Store Controller must interrogate the remote (or xe2x80x9cotherxe2x80x9d) cluster to see if the data resides in that cache. These remote interrogations are processed by the Remote Fetch Controllers, which make requests into Central Priority and access resources in a similar fashion to the Local Fetch/Store Controllers.
In addition, if the data access misses the remote cache, but the address denotes that it belongs to a memory bank attached to the remote cluster, the Remote Fetch/Store Controller also interacts with the Main Memory Controller to initiate main memory accesses. For operations which necessitate storing data into memory (such as casting aged data out of the cache), the address once again determines whether the Local Fetch/Store Controller can process the entire operation or if a remote store operation must be initiated across the BI-nodal interface. In this situation, the remote store operations are processed by the Remote Store Controller who also interacts with the Main Memory Controller to store the data into the memory interleaves. As with the Local Fetch/Store Controllers, their remote counterparts also contain all the resources (including data paths, fifo buffers, and cross point switches) necessary to process inter-cluster operations.
The present invention also interacts with a remote management system for managing the resources comprising the aforementioned Remote Fetch/Store Controllers, and to distribute work to these Remote Fetch/Store Controllers, who in turn, act as agents to perform the desired operation without requiring knowledge of the requester who initiated the work request. Work is distributed only when a remote resource is available for processing the work, without a need for constant communication between multiple clusters of symmetric multiprocessors.
In a large system such as an IBM e-server Z-Series, shared access to the cluster cache is controlled by a centralized pipeline. All requests from processors (including I/O adapters) and remote fetch and store controllers (RFAR and RSAR) must initially arbitrate for priority to enter the central pipe and obtain directory information. Based on the directory state, additional pipe passes may be necessary. Once a requester enters the pipe, a series of interlocks ensures that the desired line can""t be stolen out from under the requester. Historically, these interlocks are based on a partial line address corresponding to the address of the directory row (a.k.a. the congruence class), as opposed to a full line address. In cases where a requester must make multiple pipe passes, not only is the desired line locked until the operation completes, but so are all other lines in that congruence class. If another requester desires a different line in the same congruence class, it must wait on the first operation.
The present invention employs an asynchronous remote cast out method which enables inter nodal cast outs to completely bypass the central pipeline on the remote cluster and make an immediate request to store the data to main memory. This is possible because our invention benefits from two design advances in the System Controller. First, the System Controller (SC), incorporates a strong store ordering scheme in which a line can only exist in a changed state in one of the nodes at any time. Essentially, if a central processor (CP) wants to update a line, it must request exclusive ownership, which mandates that all other requesters must relinquish ownership. At this point, the CP can change the line and the resulting update only exists in the cluster xe2x80x9clocalxe2x80x9d to the updating CP. Subsequently, if another requester on the remote cluster desires ownership of that line, the cache management scheme requires that the line be transferred from the local to the remote cache as part of the fetch operation. Thus, the final state shows the line invalidated in the local cluster, and valid and changed in the remote cluster. On the other hand, if a processor on the remote cluster desires read-only access, a copy of the line will be sent to the other cluster, but remains changed only on the local side. Finally, if the data exists read-only in multiple nodes, has the changed line status active on a remote node, and is requested exclusively by a CP on the local node, the cache coherency scheme of the preferred embodiment results in the remote copy of the data being invalidated and the changed line status being transferred to the local node.
The second aspect of the asynchronous cast out is that the SC utilizes the concept of a high speed remote interface controller which dispatches work to fetch (RFAR) and store (RSAR) controllers on the remote side on behalf of a sister controller on the local side (LFAR and LSAR). The LFAR and LSAR controllers are the xe2x80x9cmastersxe2x80x9d of the line fetch and store operations, and in those cases where the data resides on the local side, they are the only controllers involved. But for those scenarios where the data must be acquired from, or stored to, the remote side, then the work is passed to a matching RFAR or RSAR on the other cluster. In the case of a cast out operation, LSAR, on the cluster where the requester resides, is the master controller. It has the responsibility to analyze the directory state and determine if the line needs to be cast out locally or remotely. Since the data can only be changed in one cache, a cast out operation will always begin with an LSAR attached to that cluster. Therefore, the only role of RSAR is to deliver data destined for the other cluster to main memory. RSAR decodes the work request dispatched to it by the high speed remote interface controller and if it""s a cast out, the data is stored in a buffer while an immediate request is made to main memory. As previously stated, this operation bypasses the central pipeline which means the data isn""t held up waiting for an interlock with another request for the same congruence class to clear.
Another aspect of the cache management scheme employed in the present invention is the ability to track line change status on a half line basis. In other words, if a line of data (256 bytes) must be aged out of the cache to make room for a newly requested line, but the changed bytes all exist in either the lower half or upper half, then LSAR will arrange for a transfer of only the 128 bytes that contain the changes. Although RSAR bypasses the pipe in either type of cast out, it does differentiate between the two types to reduce the required inter nodal and main memory bandwidth required. Lastly, the present implementation of the asynchronous cast out mechanism permits more efficient management of the main memory banks by reducing the latency from the time the cast out operation begins on the local side to the time the request arrives at the remote main memory controller. For example, data can be cast out to idle memory banks while other types of storage ops (i.e. I/O Stores) expend time negotiating pipe arbitration and multiple pipe passes.
Our invention enables the same RSAR that handles asynchronous cast outs to operate in a traditional synchronous fashion (like LSAR, LFAR, RFAR, etc.) to handle other types of store operations which necessitate performing directory accesses, updates, and interlocks. The various operations range from storage padding, to main memory move page store ops, to the I/O Adapter storing into main memory. In the first two cases, the data is always sent to main memory. If the target address is the remote cluster, then LSAR immediately transmits the data across the clusters with the appropriate command. Unlike remote cast outs, the data may exist in the remote cache, so our invention must enter the central pipeline for purposes of analyzing the directory state. On a miss, an immediate request can be sent to the memory controller. If the data hits in a xe2x80x9cread-onlyxe2x80x9d state, RSAR will invalidate the directory entry and broadcast cross-invalidates (XIS) to the processors. If the data hits with xe2x80x9cexclusivexe2x80x9d ownership, RSAR must coordinate the completion of any pending CP Stores with the required invalidations of said CP stores before it can permit the memory store to complete. In cases where the storage padding or move page store op target the local cluster, the possibility still exists for the data to reside in the remote cache. In this situation, LSAR sends only an invalidation command but leaves the data in the local cluster. Once again, RSAR performs the necessary directory access and/or updates, but doesn""t need to perform any transfer to main memory.
The most complex type of synchronous operation that the present invention incorporates is stores from the I/O Adapter into main memory. Unlike the storage pad and move page store ops, I/O stores will overwrite cache data in cases where data hits in either cluster. Furthermore, the operation is complicated by the fact that an I/O Adapter can issue a store at any time without first requesting exclusivity to the line. Thus, the data can exist in almost any directory state at the start of the operation. When an I/O store if first received by the SC, LSAR on the local side performs a preliminary directory analysis. In cases where the I/O store hits the local cache, the data will be stored into the local cache. Thus, RSAR doesn""t need to participate in an data transfer. If the data hits read only in both caches, LFAR must send a read-only invalidation to the other cluster, and although the present invention could easily handle that request, for reasons of simplicity the SC encompassing the present embodiment uses the Remote Fetch Controller (RFAR) to perform these read-only invalidations.
I/O stores that miss the local cache, always involve the present invention, and the method for handling the store is governed by the target address. If the I/O store targets the remote side, LSAR will immediately dispatch the data with the command. RSAR will perform the same directory analysis and update actions as with the storage pad and move page store operations. In the event of a hit in the remote L2 cache, RSAR will transfer the I/O store data into the remote cache, instead of requiring the data to be cast out to main memory. For I/O stores that target the local side, LSAR sends over a special query command. This command serves two purposes:
1. It allows RSAR to interrogate the remote directory and determine if the data resides in the remote cache
2. In the case where the data hits, it serves as a means for holding RSAR valid while LSAR subsequently transfers the data.
In the aforementioned cases, RSAR uses an innovative technique of xe2x80x9clockingxe2x80x9d the resource during the query in anticipation of the need to subsequently transfer the data. If it turns out to be a case where no directory update needs to occur, or the directory only needs to be invalidated, RSAR will release itself once it returns the final response. However, if it is a case where the data needs to be transferred between clusters, RSAR will remain valid after the query response is returned. LSAR will ensure that the next command sent to that RSAR is the I/O Store data transfer. One advantage to this method is it only requires the use of a single LSAR/RSAR pair to handle all aspects of the I/O store. Another advantage is the prevention of deadlocks or changes in directory state which can result in leaving the line xe2x80x9cunlockedxe2x80x9d during the time between the query repines and the reception of the I/O Store data transfer. One final advantage of the present invention is the ability to handle 64 byte and 128 byte I/O Stores. Like the asynchronous cast outs, this enables better inter nodal bandwidth by only tying up the data busses for half the transfer time if the I/O adapter only needs to update 64 bytes of storage.
Our invention offers several improvements over the remote store controller implemented in the prior generation S/390 G5 and G6 Enterprise Servers. As previously stated, the present invention exploits the cache coherency scheme to permit asynchronous cast outs to main memory. Previous RSARs utilized the centralized pipeline for all data stores to main memory, which not only delayed the initiation of the store request, but also introduced the potential for the operation to be rejected out of the pipeline, thereby necessitating a recycling of the operation.
Lastly, the prior design point used a complex mechanism for handling I/O stores. In cases where the I/O store targets the local cluster and missed the local cache, a special xe2x80x9cforce cast outxe2x80x9d command would be sent to the remote RSAR. This command required the RSAR to query the remote directory and if the data was resident, it would enlist an LSAR on the remote cluster to initiate an immediate cast out of the data. That LSAR would, in turn, send a remote cast out operation back to the local cluster, thereby necessitating an RSAR on the local side. In certain situations, the local RSAR could be busy, thus delaying the cast out and impeding the entire I/O store. In the worst case, the local RSAR could be busy processing a different operation which has an address compare against the I/O store, thus creating a deadlock. The prior systems contained a great deal of logic to detect these cases and break the deadlocks, which led to increased design complexity. It also had the drawback of requiring 4 resources (the local LSAR, remote RSAR, remote LSAR and local RSAR) in order to complete I/O stores scenarios requiring forced cast outs. The present invention offers greater design simplicity in addition to a more efficient approach to handling I/O stores.
Although the present invention is being described in association with the present preferred embodiment, one skilled in the art will appreciate that the concepts disclosed herein are applicable to systems comprising more than two clusters, and utilizing Storage Clusters differing from our present embodiment. Additionally, the present invention contemplates alternate System Controller embodiments with a different number and configuration of functional units, including, but not limited to, the cache structure, the main memory organization, the number and size of data path resources (such as buffers, control busses, etc.), the composition of the various controllers, and the number and size of the Pipelines.
These and other improvements are set forth in the following detailed description. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.