This invention generally deals with increasing the efficiency of message communications occurring at high volume and high speed among nodes in a network, in which the nodes may be central electronic computer complexes (CECs). The invention segments the transmission of packets in messages, wherein the segments are transmitted as high speed bursts of digital packets on a link in a network. This invention can significantly reduce cache castout thrashing in port packet control caches. In networks containing a common link switch to enable single port per node operation, this invention can increase overall network communication speed by maintaining transmitted segments within a network link switch, which is lost in conventional network link switches due to their failure to maintain segment contiguity when forwarding switch received packets to destination nodes.
Communication networks contain N number of nodes in which each node may be a computer system, often called a Central Electronic Complex (CEC). Messages are communicated on links between the nodes of the network, and any node in the network may both send and receive messages. A node may be considered a message sender when it generates and sends a message, generally starting with a command. A node may be considered a message receiver if it receives the message. The command part of a message is followed by a response part of the message for informing the message sender of the status of the message received at the message receiver. A data part of the message is optional, and the data part may be included between the command part and the response part. The data part may be read data or write data, which are transmitted in either direction between the message sender and message receiver.
Each message is transmitted as a sequence of packets on one or more links connected between the message sender and message receiver in the network. Each packet header contains a source node ID and a destination node ID. Generally, each message starts with one or more command packets, which travel on the links in the direction from the message sender (generating the message) to the message receiver (receiving the message). After the command part of the message is transmitted, it is followed with any optionally data part of the message as a sequence of one or more data packets, which may travel in either direction on the network links according to whether xe2x80x9creadxe2x80x9d data or xe2x80x9cwritexe2x80x9d data is indicated in the command part of the message. xe2x80x9cWrite dataxe2x80x9d travels from the message sender (commanding node) to the message receiver (commanded node). xe2x80x9cRead dataxe2x80x9d travels in the opposite direction from the message receiver to the message sender. The message ends when its response part is sent by the message receiver to the message sender. The response part of the message follows any optional data part of the message, but the response part follows the command part if the message has no data part. Thus, the response part is transmitted on the links in the opposite direction from the command part.
A link switch may or may not be used in a network to connect the nodes in a network. A link switch may contain a plurality of switch receivers and switch transmitters, which may be respectively connected to bi-directional communication links to/from the nodes in the network, such as when a respective switch receiver and switch transmitter pair are connected to the link to one of the nodes in the network. Each of the receiver/transmitter pairs may be permanently assigned to a link connected node, the receiver receiving packets from the node when the node is acting as a source node and the transmitter sending packets to the node when the node is acting as a destination node. Each node has a unique identifier (ID) in the network, and each packet has a header containing the source node ID (source ID) and destination node ID (destination ID) of its message.
In a network switch, each of a plurality of switch receivers may be concurrently receiving packets from different source nodes, and each of the switch transmitters may be concurrently sending packets to different destination nodes. Thus, each receiver then always receives packets from same source node (to which it is connected), so that all packets received by each receiver have the same source ID, but may have different destination node IDs.
Further, each transmitter in the switch searches the headers of newly received packets at all of the switch receivers looking for a packet header having a destination ID matching the destination ID assigned to the respective transmitter. Then the packet is forwarded from the receiver to the transmitter having the destination ID in a received packet, and the transmitter sends the packet from the switch to the identified destination node.
During a receiver search, a transmitter may find multiple concurrently received packets at different receivers matching the transmitter""s assigned destination ID, in which all such concurrently received packets have different source IDs, but all have the same destination ID which identifies the node connected to the transmitter. The transmitter may use a packet priority control to determine which of these concurrently received packets from different nodes should be selected next and sent to the transmitter""s assigned destination node. Generally in the prior art, the switch priority control uses a round-robin selection among the receivers having concurrently received packets, so that the concurrent received packets are sequentially sent by the transmitter to its connected destination node, which at the destination node interleaves this sequence of link-communicated packets into different messages from different source nodes.
When sequences of packets are provided to a link, they comprise a burst of signals. If these packets are received by a network link switch, the speed of the packets in a given message may be slowed by the priority controls in the internal operations in prior art switches, because of the internal priority schemes used in the switches. This signal slow-down may be caused by a xe2x80x9cfairnessxe2x80x9d priority protocol often used in link switches to select among concurrently received packets for transmission to the packet""s indicated destination node. Generally, the prior art xe2x80x9cfairnessxe2x80x9d priority selects for next transmission to an assigned destination node, either: the oldest waiting packet currently received by the switch, or the it may poll the receivers in the switch in a predetermined order (such as xe2x80x9cround-robinxe2x80x9d) and select the first receiver found to have a waiting packet.
These prior types of xe2x80x9cfairnessxe2x80x9d prioritization""s schemes in a link switch tend to lose the transmission continuity of fast transmitted bursts of packets. A sequence of packets may be considered a burst when they are transmitted very fast on a link without any significant pause occurring between the packets in a burst. A loss in packet transmission continuity within a burst (such as due to packet interleaving by switch prioritization) may result in a slow-down in the ultimate speed of packet communication seen by the destination nodes. This slow-down therefore may be caused by the prior art xe2x80x9cfairnessxe2x80x9d prioritization selection process in the prior art when it selects for next transmission by a transmitter the longest waiting packet in another message to the same destination node, or selects the next transmitted packet by a round-robin selection among the receivers.
Hence, while the prior art xe2x80x9cfairnessxe2x80x9d prioritization schemes may appear to be the desirable thing to do, they have the unobvious effect of interrupting the bursts of packets being transmitted on the links, and the interruptions may occur anywhere during the packet bursts. This is because prior art link switches do not detect the burst characteristics among sequences of received packets, which would violate their xe2x80x9cfairnessxe2x80x9d prioritization schemes for selecting a longest waiting packet in another message to the same destination node, or a packet at a next receiver scanned in a round-robin manner.
In each message transmitted on a link, there are various types of pauses and interruptions occurring between some of the packets transmitted within the message, such as the interruption between the end of the command part of each message and following read data part sent on the link in the opposite direction from the command part. Further, a controlled interruption in each message occurs at the required response part of the message, when it follows either the command part or a write data part which are transmitted in the opposite transmission direction between the message sender and message receiver. These transmitted bursts of packets are called xe2x80x9csegmentsxe2x80x9d in this specification.
The parts of each message of concern to this invention are the parts seen and stored by the destination node in a memory of the destination node, and it may use link transmission protocols which involve link response signals at the end of each packet to confirm successful reception from the link. These link controls are not the part of any message with which this invention is concerned, because such signal are thrown away as soon as their purpose is met of confirming successful link transmissions of each packet. If such link protocol signals result in slowing down a transmission, they may have the effect of interrupting a planned burst of packets, which may effectively divide a planned burst into two or more unplanned xe2x80x9csegmentsxe2x80x9d in a message being handled in a network switch designed according to this invention.
Thus, the subject invention utilizes the packet prioritization selection characteristics in a network link switch containing the invention described and claimed in patent application Ser. No. 09/439,012, which requires the switch to recognize both planned and unplanned xe2x80x9csegmentsxe2x80x9d occurring in the transmission of packets in each message, and does not use the prior art xe2x80x9cfairnessxe2x80x9d priority controls previously used by network-link switches to control the switch""s re-transmission of packets.
Independent of whether or not a network link switch is being used in a network, this invention provides in each nodal port in a network special castout controls for use in a control cache provided in each port of the network. The control cache is provided at each node of the network, whether the node has a single port or multiple ports, but these castout controls are particularly effective with ports having very high traffic such as is more likely to be found with nodes having single ports connected by a network link switch. These castout controls are used with inbound segments of packets in messages being sent and received by the port.
The control caches of this invention are dynamic caches, in that they only store valid contents for messages in transmission. That is, whenever a message transmission is completed, all cache contents for that message are castout to the nodal memory, and the cache space occupied by these contents is made available for use in the transmission of another message. Each cache entry is usable by controls for a different message being transmitted. The number of cache entries in any cache is limited in number, and when all cache entries are full, the contents of an entry must be selected for castout to locations in the destination node""s memory.
The control cache used by this invention should not be confused with conventional caches which store data or instructions. The control cache of this invention only stores control information which is used in controlling the flow of message data between a link buffer and a nodal memory. That is, the message data never gets into the cache, wherein the control cache only stores control information, such as a list of nodal memory addresses for storing payloads of segments of transmitted packets moved to or from a link buffer in the local port and a nodal memory which is not in a port.
Bursts of digital signal are transmitted in sequences of packets between a source node and a destination node, and these sequences only pass through any network switch when the network has a switch. The sequence of packets comprising each segment is setup at the segment""s source port, which is in the source node""s memory prior to transmission. After packet setup, the segment of packets is transmitted at the maximum link speed as a sequence of digit signals from the source node to the destination node. Each burst may involve a few packets or may involve a large number of packets, and a burst must end when a segment boundary is reached within the message, such as the end of the command part of the message or the end of the data part or the response part of the message. A segment may end when the source node reaches a transmission point in a sequence of packets at which the transmitting node port must momentarily stop transmitting the sequence to perform a housekeeping operation before it can continue the transmission. For example, a momentary pause may be caused by line fetches for memory accesses, or interruptions may be caused by a page fault for a disk access to obtain data to maintain a data packet transfer on the link. Also, source computer task switching may occur during transmission of a sequence of data packets and cause a temporary interruption. Thus, any of these pauses and interruptions within a message being transmitted on a link ends a segment (a high-speed burst being transmitted). If the packets of these segments go through a network switch, the switch needs to have special controls to recognize and maintain the segment xe2x80x9cburstsxe2x80x9d.
When a network switch is used to allow the use of single port nodes, the great advantage of easy scalability of the network size is obtained. The scalability advantage in the use of a network link switch may be shown by comparing a switched network containing N number of nodes with an unswitched network containing an equal number of nodes. It is known that N number of nodes in a network may be connected via links in any of several different ways. One way is to use non-shared bi-directional links, in which the non-shared links respectively connect different pairs of the nodes in a switchless network. Simultaneous communication of messages is enabled by the non-shared links between the different pairs of nodes in the network on the different links. This switchless network has the great disadvantage of lacking ease of scalability in the size of the network if it is later decided that one or more nodes should be added in the network to its N number of nodes.
This scalability difference may be shown as follows: A switchless network requires N(Nxe2x88x921)/2 number of non-shared links in a network having N number of nodes. Then, each node in the switchless network is required to have Nxe2x88x921 ports that respectively connect to the unshared links in the network. The significant disadvantage in its network scalablity is primarily caused by the (Nxe2x88x921) number of ports required in each node of the network, since the number of ports must be changed in every previously existing node in the network when the number N of nodes is increased in the network. This can only be done with great difficulty and expense.
The switched network provides a solution to the scalability problem when it connects all nodes through a single link switch, because then each node only need use a single port to send/receive all of its messages to/from all other nodes in the network through the link switch. However, the single transmission port of each node in a switched network must operate at a much faster transmission speed than each port in a switchless network when communicating the same number of messages in a network, because each of the single ports is required to handle, on average, N times the number of messages per port in a switchless network. This increased message speed and traffic for each port in the switched network requires each port to operate at a communication rate that is N times faster than each port in a switchless network. Thus, the faster link transfer rates required in switched networks may strain the ability of the single nodal ports to handle the greatly increased message transmission rates and added volume of messages, which indicates the added efficiency provided by this invention is particularly useful in the single ported nodes of switched networks.
FIG. 1 shows an example of a switchless network having four nodes (i.e. four computer systems) 101, 102, 103, 104 which are fully interconnected by links 111, 112, 113, 114, 115, 116 without using any link switch. Each port connects its node to only a single other node in the network, so that each node requires multiple ports to connect multiple other nodes in a network. Full connectivity to all nodes in the network of FIG. 1 is obtained through three ports at each node. For example, node 1 has the three ports 121, 122, 123, and a corresponding three ports are found likewise in each of the other three nodes 2, 3 and 4 in the network. In the switchless network configuration of FIG. 1, each port can only communicate to one other node.
N nodes are found in a network of the type shown in FIG. 1, and the N nodes require N*(Nxe2x88x921)/2 links, in which each node requires Nxe2x88x921 ports connected to Nxe2x88x921 links. Thus, the 6 links in FIG. 1 connect the 4 nodes by each node having 3 ports connected to 3 of the 6 links. As the number of nodes, N, increases in a network, the number of links grows as the square of N. For example, a network of 16 nodes would require 120 links, and each node would require 15 ports. The switch-free network arrangement in FIG. 1 clearly becomes more difficult to implement as the N number of nodes in the network increases, due to an N squared increase in number of links required, and a linear increase in the required number of ports per node.
For these reasons, this invention prefers the link-switched environment in a network of the type shown in FIG. 2 to overcome scaleability problem encountered by the switchless network of the type shown in FIG. 1. FIG. 2 has a communication link switch 201 connected between four nodes 211, 212, 213, 214, each node being a computer system, which may be the computer type provided for each node in FIG. 1, and the computer system of each node may have a single shared memory and any number of central processors.
Hence in FIG. 2, only one port is required per node regardless of the number N of nodes in the network. Then the total number of ports and links in the network may be equal to the number N of nodes in the network. Thus, N number of links may connect N number of nodes in the switched network of FIG. 2.
Also, the port control provided for each single port per node in FIG. 2 is significantly different from the port control provided for each of the multiple ports per node in the network configuration of FIG. 1. Each of the four links 221, 222, 223224 shown in FIG. 2 are connected to the same link switch 201.
Accordingly, the number of links in a switched network of FIG. 2 increases linearly with an increase in the number N of nodes in the network. Also, N is the total number of links in the network. Hence in FIG. 2, each node requires only one port 231, regardless of the total number N of nodes and the total number N of links in the network. In the detailed embodiment described herein, the link switch contains N number of receivers and N number of transmitters, and each node in the network is uniquely connected to one receiver and one transmitter in the switch.
While the switched network of FIG. 2 reduces the number of hardware links and ports to one per node, the complexity of the network is therefore increased in several ways. First, a hardware link switch 231 contains novel internal packet-priority-selection controls. Second, novel castout controls are provided for each single port per node to enable the port to recognize segment characteristics occurring in its communications with all other Nxe2x88x921 nodes in the network (e.g. the three other nodes in FIG. 2). A consequence in switched networks of the type in FIG. 2 is that the one port per node is required to handle an average of Nxe2x88x921 amount of message state information, when compared to the amount of message traffic handled by each node in the switchless network shown in FIG. 1. Nevertheless, a significant cost improvement is obtained by the switched network in FIG. 2 over the network in FIG. 1 for networks having a large number of nodes, because hardware ports and their installation involved much more expense than the added speed and storage required in the ports of the network in FIG. 2.
For all of these reasons, this invention prefers the network of FIG. 2, primarily due to the comparative reduction in the required number of ports per node as the number of nodes is increased in a network. The number of ports in the network of FIG. 2 increases linearly as the number of nodes increases, compared to nonlinear increase (by the square of N) in the switchless network of FIG. 1 having N*(Nxe2x88x921) ports. Then, the hardware cost savings of the network in FIG. 2 varies with: {N*(Nxe2x88x921) portsxe2x88x92N portxe2x88x92link switch}, and these savings are significant for networks having a large number N of nodes.
This invention defines and handles segments in its transmitted messages to place the occurrence of pauses and interruptions occurring during any message between the defined segments in the message. This allows each segment to be transmitted as a high speed signal burst. Each segment is handled as a transmission unit at the source node, at the destination node, and in any network switch. The segments are detected in the network switch as its packets are received, and the switch can interleave the detected segments of concurrent messages having the same destination node while maintaining forwarding speed for the packets in each message. Unexpected pauses and interruptions exceeding a time-out period occurring within any transmitted segment are handled efficiently in the switch. At the destination node of each transmitted packet, this invention enables a port to detect the segments (i.e. in commands, data, and responses in each received message), and a port cache controls the assembly of the received messages while reducing cache castout thrashing to enable maximum reception speed for the messages.
The node transmitting packets is the xe2x80x9csource node,xe2x80x9d and the node receiving packets is the xe2x80x9cdestination node.xe2x80x9d The source node and destination node IDs are contained in each packet header. The xe2x80x9cmessage senderxe2x80x9d is the node that transmits the command packets and optional xe2x80x9cwritexe2x80x9d data packets. The xe2x80x9cmessage receiverxe2x80x9d is the node that transmits any optional data packets followed by the response packets.
It is an object of this invention is reduce castout thrashing of messages controlled in nodal caches for retaining the messages in a nodal main memory. The castout reduction enables an increase in the rate and number of messages which may be handled at each nodal port. Castout thrashing occurs when incomplete messages are castout from the portal cache to make space for a new message when the castout message will later have to be re-fetched into the caches to receive more packets for its completion.
It is another object of this invention to increase the speed of message communications in a network of nodes using portal caches to assemble link-communicated messages by using novel priority control processes in each nodal port using a port cache for controlling the assembly of received messages. If a network link switch is used in the network, the switch must maintain the segments as they pass through the switch in the manner taught in concurrently-filed patent application Ser. No. 09/439,012. In that specification, a new prioritization method is used in network switches that prioritizes internal switch selection of switch received packets for transmission to the destination nodes among packets currently received by the network switch. This network switch prioritization enables the switch to avoid a reduction in the transfer rate of packets through the switch as occurs in prior switches using prior prioritization methods. The switch prioritization sends the newest (i.e. most recently received) packet to the switch transmitter connected to the destination node identified in the packet, regardless of whether the packet is selected out-of-order relative to other prioritization protocols such as FIFO, LIFO, round-robin, xe2x80x9cfairnessxe2x80x9d, etc. Its xe2x80x9cnewness protocolxe2x80x9d enables messages communicated on links in the network to be assembled in portal caches at a reduced castout rate to improve the message handling efficiency in the network. A reduced castout rate reduces xe2x80x9cCastout thrashingxe2x80x9d in the number of cache fetch operations used for messages being assembled in the port. xe2x80x9cCastout thrashingxe2x80x9d is caused by unnecessary castouts causing associated re-fetch operations.
The packets are transmitted in segments on links from a source node to a link-switched destination node. Each segment is transmitted by the source node to a connected link as a burst of digital signals.
Each transmitted message is initiated by a command provided to the connected link by the message sender. The command part of a message is a sequence of packets comprising the first segment of each message. Each segment is a burst of signals comprising one or more packets sent on the link to a network switch. The last packet in each segment is indicated by either: a last-packet indicator in each packet of the segment, or by a last-packet count in the first packet of each segment.
Each packet in the same segment has the same source node indication and the same destination node indication. Each transmitter in the switch stores the source identifier of its last transmitted packet, and the stored source identifier is reset to a null value when the last packet of the current segment is transmitted.
When any packet is received by any network switch, the switch associates the packet with the switch transmitter assigned to the destination node identified in the packet. Each received packet is transferred by internal priority controls from its receiver to its assigned transmitter. Each of plural transmitters in a network switch may be operating independently of each other to transmit packets to connected destination nodes. Each switch transmitters enforces the special internal switch priorities to maintain segments passing through the switch, since otherwise the priority operations within the switch may unknowingly breakup the received segmemts into smaller segments resulting in a slowing of the transmission of a message.
This invention provides a new castout replacement process for use in communication caches of the nodes of a switched network for assembling received (inbound) messages. The new castout replacement protocol selects for castout the newest (i.e. most recently serviced) cache entry in the nodal communication cache when it is most likely to have the longest wait before it again needs to be in the cache for its next servicing when it will next receive a data segment or a response segment. The selection for castout of the newest (i.e. most recently serviced) cache entry is enabled by the new priorities provided in the network switch which focus the resources of the switch on the most recent (newest) message segment having a transmitted packet.
The sequence of packets comprising each transmitted message (commands, data, and responses) is segmented by the source node. The source node attempts to send the packets of each message as fast as it can by providing the least possible delay between its transmitted packets. However, significant delay cannot be avoided between some packets in a message. Whenever a significant delay is encountered, the source node ends the current segment to allow other messages to use resources which would otherwise not be available during that delay time. Delays occur in each message at its source node, for example, between the command segment and any first write data segment, between data segments when a new memory access needs to be made for a new data line, and between the last read or write data segment and the response segment. Thus, the data of a message may be divided into segments to allow immediate transmission of small chunks of data (typically a memory line). If all the data in a message had to be transmitted contiguously, all of the data of the segment would need to be fetched before any of it could be transmitted, and this would add latency delays in the operation of the source node""s operation.
The castout operation is to a Message Control Block (MCB) in a xe2x80x9cMCB Tablexe2x80x9d in the main memory of the respective node. The MCB Table contains all MCBs for all messages sent and received by the respective node. The MCBs may be located in the MCB Table in any predetermined manner, such as by indexing the MCB slots therein according to the xe2x80x9csource ID, message IDxe2x80x9d found in each packet of each message received and sent by the node. The communication cache is located in the port of the node in a local memory of the port, which need not be part of the main memory of the respective node (i.e. a computer system).
This invention is preferrably used in an environment that allows for the automatic expansion of the number of nodes in the network in which all network communication through a network switch uses only a single port in each node of the network. Network expansion only requires adjustment in the size of the MCB Table in the main memory of each node in the network. Node expansion in the network does not affect the hardware in any port used by the respective nodes in the network and need not affect the cache structure in each node, regardless of the number of nodes in the network. Expanding the number of nodes in a network requires that the MCB Table have a slot for each node in the network, and increasing the number of nodes in the network then requires another slot be added to the MCB Table for each added node.
Each message control block contains a plurality of different segment type identifiers in a plurality of pointers (addresses) which locate areas in the computer memory for storing the commands, data, and responses in the payloads of the packets of a message received or sent by the port of each node.
This MCB structure is copied into any allocated empty MCB Entry in the associated port cache when the MCB is activated by the transmission of a message over the network to or from the respective node. The associated port cache may have a directory of tags, in which the tags are respectively associated with MCB Entries in the cache. Each tag includes a field for a source ID, a message ID, and an empty bit to indicate if an MCB Entry is empty or contains a valid MCB.
When any packet of an inbound message is being forwarded from the switch, any non-empty MCB Entry assigned to that message in the cache is found by comparing the source ID and message ID in each packet of the message with the non-empty tags, and a compare equal indicates that the tag having an associated MCB Entry in the cache is found. If no compare equal tag is found, a cache miss occurs, and any empty MCB Entry is assigned to the tag and the source ID and message ID in each packet of the message is written into the tag and its empty bit is set to the non-empty state. However, if no empty MCB Entry is found in the cache, the newest MCB Entry is castout to the MCB Table, its associated Tag is set to the empty tag state, and it this tag and its associated MCB Entry is reassigned to the packet being currently forwarded.
The link-switched network arrangement enables each node""s single port to be easily adapted to expansion of the number of nodes in the network, which would not be possible in a non-switched network having Nxe2x88x921 number of ports per node in the network, since N is increased which would required more ports for every node in the network (a burdensome hardware increase).
Accordingly, this invention uses contiguous packet transmission within each segment of a message to control its segmented message communications. The replacement operations in the destination node""s cache are also driven by the message segmentation to provide a more efficient method of handling the communication of messages between nodes in a network than the previously allowed switching between xe2x80x9cpacket transmissionsxe2x80x9d. An improvement in communication efficiency results in the use of this invention.