1. Field of the Invention
The present invention relates to computing systems, and more particularly, to improving performance of clustered systems.
2. Background
Conventional computer systems typically include several functional components. These components may include a central processing unit (CPU), main memory, input/output (“I/O”) devices, and streaming storage devices (for example, tape drives) (referred to herein as “storage device”).
In conventional systems, the main memory is coupled to the CPU via a system bus or a local memory bus. The main memory is used to provide the CPU access to data and/or program information that is stored in main memory at execution time. Typically, the main memory is composed of random access memory (RAM) circuits. A computer system with the CPU and main memory is often referred to as a host system.
The storage device is coupled to the host system via a controller that handles complex details of interfacing the storage devices to the host system. Communications between the host system and the controller is usually provided using one of a variety of standard I/O bus interfaces.
Typically, when data is read from a storage device, a host system sends a read command to the controller, which stores the read command into the buffer memory. Data is read from the device and stored in the buffer memory.
Commercial success of server and data storage systems triggered the evolution of “clustered systems” that use standard microprocessors, memory devices and operating systems. A cluster is a parallel or distributed system that includes a collection of interconnected computer systems or servers that operate as a single unified unit. Members of a cluster may be referred to as nodes or systems.
Clustering may be used for parallel processing or computing (similar to massively parallel processors (“MPP”)) to simultaneously use two or more processing units (“CPU”) to execute an application or program.
Clustered systems may be used in storage area networks (“SANs”) using standard technologies (for example, the Fibre Channel standard based technology, incorporated herein by reference in its entirety). SAN allows storage systems to be shared among multiple clusters or servers. Storage devices may be configured as a redundant array of inexpensive disk (“RAID”) configuration.
FIG. 1A shows a block diagram of a network 10 using a cluster 12 with servers 14 and 16. Cluster 12 communicates with the network via interface 18. A switch 20 facilitates communication between network 10 and SAN 22 that includes storage 30, 32 and 34 with RAID controllers 24 26 and 28.
In clustered systems, processors use one or more caches (used interchangeably with memory devices) to temporarily store values that may be accessed by the processors. A clustered system uses a coherent memory system to maintain order for plural processors to perform write/read operations.
There are a number of protocols that may be used for achieving coherency for these multi-processor operations, for example, the MESI protocol, incorporated herein by reference in its entirety. The protocols allow one processor to write to a given memory location at any give time.
To implement coherency in a clustered system, processors communicate over a command interconnect system (for example, a PCI bus). The processors use the bus to send a message either to read or write data to memory locations. When an operation is placed on the interconnect system by a processor, other units monitor (“snoop”) the operation and decide if their cache state allows the requested operation to proceed.
FIG. 3 shows a state diagram for the “MESI” protocol with four states:
Invalid (i.e., a cache is not valid);
Exclusive (valid memory and cache has one copy of the data);
Shared (more than one cache holds a copy of the data and memory is valid); and
Modified (data has been modified and the memory copy is invalid).
Under the MESI protocol, a cache which reads a copy of a modified value would switch from the Invalid state to the Shared state and the modified response is picked up by a memory controller to allow data to be written.
To improve the performance of clustered systems, network processors are used with host processors that allow user-level network (“ULN”) access. FIG. 1B shows a block diagram using a network interface card (“NIC”) in a clustered system 12. System 12 uses a host processor 101 that is coupled to a memory controller 102 allowing access to a computer bus 104. This allows host processor to communicate with host memory 103 and network 108 via NIC 100.
NIC 100 includes a network interface (“NI”) processor 105 and NI memory 106. NI Processor 105 communicates with a controller 107 via bus 110. Controller 107 has a micro-controller, an error correction code engine, data flow controller, a transmit and receive unit to allow communication to and from system 12 via a network 108, described below.
It is noteworthy that network 108 may be Ethernet, Fibre Channel or any other standard or proprietary protocol based.
Clustered system performance is affected by software/hardware overheads. Communication using NIC 100 requires additional overhead that affects overall performance, as described below.
In a ULN architecture (FIG. 1B) each process uses a trusted entity at channel setup to provide direct and protected access to network 108. Overloading the virtual memory mechanisms of the host operating system (“OS”) typically provides the protection. By selectively mapping the message area and endpoints, processes are protected from accidental and/or malicious hampering of their network access paths.
Typically, each endpoint in a ULN has a pair of queues (Send Queue to send messages and Receive Queue to receive messages) and a “descriptor”. The descriptor includes information for sending/receiving messages. To send a message, the user (host processor 101) process selects a memory region through the OS and composes a message in the message area. A descriptor for the message is posted in the send queue. When NI processor 105 “views” the posted message. it programs its direct access memory (“DMA”) engine and the messages are copied to NI buffers (for example, 106). The notification or pending messages to and from host 101 and NIC 100 may be accomplished using a “doorbell” mechanism which typically involves conditional busy-waiting. Similar actions are performed on the receive side.
FIG. 2 shows a top-level flow diagram for communication between host processor 101 and NIC 100. In step S200, host processor 101 allocates buffer space in NI memory 106 and copies data. In step S201, host processor 101 composes a message for NIC 100 and notifies NIC 100. At this stage NI processor 105 takes control of the “message” send operation.
In step S202, NI processor 105 processes the message. In step S203, NI processor 105 notifies host processor 101 that the message has been sent out (or injected) to network 108. Once again control transfer occurs. This time it is from NI processor 105 to host processor 101. Thereafter, host processor 101 releases the message buffers. Typically, message buffers in NIC 100 are preserved until an acknowledgement message is received from the message receiver of successful reception.
As discussed above, some type of control transfer occurs between host processor 101 and NI processor 105 and it is not easy to hide this control information exchange-overhead in common messaging paths. In conventional systems, host processor 101 polls NIC 100 for confirmation information and NI processor 105 polls for pending requests from host processor 101. This “busy-wait” process consumes host processor 101 and NI processor 105's computing power, while both processors could be doing other work.
The polling also requires memory bus-transactions and is often performed at pre-determined intervals. Polling transactions also reduce the overall memory bus bandwidth.
Therefore, there is a need for a method and system to efficiently synchronize host and NIC communication.