1. Field of the Invention
The invention relates to apparatus and an accompanying method for a communications protocol in generally a distributed, and particularly a highly parallel, multi-processing environment, for communicating data of arbitrarily varying strides between separate processors and generally without the need for intermediate data storage.
b 2. Description of the Prior Art
With the continual evolution and commercial availability of increasingly powerful, sophisticated and relatively inexpensive microprocessors, distributed, and particularly massively parallel, processing is being perceived in the art as an increasingly attractive vehicle for handling a wide spectrum of applications, such as transaction processing, heretofore processed through conventional mainframe computers.
In general, distributed processing involves extending a processing load across a number of separate processors, all collectively operating in a parallel or pipelined manner, with some type of interconnection scheme being used to couple all of the processors together in order to facilitate message passing and data sharing thereamong. In the past, distributed processing architectures, of which many variants exist, generally entailed use of a relatively small number of interconnected processors, typically two and often less than ten separate highly sophisticated central processing units as would be used in a traditional mainframe or super-mini-computer, in which these processors would be interconnected either directly through, e.g., an inter-processor bus, or indirectly through, e.g., a multi-ported shared memory, such as a shared digital access storage device (DASD), or other communication path. By contrast, in massively parallel processing systems, a relatively large number, often in the hundreds or even thousands, of separate, though relatively simple, microprocessor based processing elements are inter-connected through a communications fabric formed of a high speed packet network in which each such processing element appears as a separate node on the network. In operation, the fabric routes messages, typically in the form of data packets, from any one of these processing elements to another to provide communication therebetween. Each of these elements typically contains a separate microprocessor and its associated support circuitry, the latter being typified by, for example, random access memory (RAM), for program and data storage, and input/output (I/O) circuitry. Based upon the requirements of a particular system, each element may also contain read only memory (ROM), to store initialization ("boot") routines as well as configuration information, and/or other circuitry.
Each distributed processing element, particularly in a massively parallel processing system, also contains a communication sub-system that interfaces that element to the communications fabric. Within each element, this sub-system is formed of appropriate hardware circuitry, such as a communications interface within the I/O circuitry, and associated controlling software routines, the latter being invoked by an application executing within that one element in order to communicate with any other such processing element in the system.
Improved system performance, particularly with attendant decreases in circuit complexity, cost and and/or system size, is a primary goal in the design of any distributed processing environment. In the context of massively parallel processing, performance improvements can result from, inter alia, decreasing message transit time through the communications fabric and decreasing processing time undertaken within each processing element and necessitated by overhead tasks, such as handling inter-processor communication. Eliminating unnecessary circuitry within each element will reduce elemental complexity and concomitantly the physical size and cost of both each element and advantageously the overall system. Attaining improvements of this sort is particularly important in massively parallel processing environments given the substantial number of inter-processor messages that simultaneously pass through the system at any one time as well as the sheer number of individual processing elements involved and their combined space requirements. Accordingly, the following discussion will address the need for these improvements as they arise in the context of massively parallel processing systems.
As one can appreciate, message passing forms an integral component of any massively parallel processing system. To yield proper system performance, the communications fabric must provide a requisite capacity, based upon the number of separate processing elements connected thereto, to simultaneously route, at any time and without contention, an anticipated peak load of inter-processor messages. In addition, each processing element itself must only spend a minimal amount of time in handling its overhead tasks, such as those, for example, required by the communication sub-system to transmit and receive messages through the communications fabric. Clearly, an insufficient communications throughput or a reduced amount of application processing time available at each processing element will adversely affect system throughput, possibly to a point of, sharply reducing the attractiveness of using a massively parallel processing system in a given application. Fortunately, various different architectures for packet networks have been proposed in the art that, at least for the time being, should afford a sufficiently high transfer rate to handle the needed peak message traffic which is expected to occur in a typical massively parallel processing system. However, as will be seen below, the art has not adequately reduced overhead processing time associated with message passing undertaken by the communication sub-system. With this in mind, the bandwidth of a high speed communications fabric tends to be much less of a limitation on application throughput, and hence overall system performance, than does the overhead time required by each processing element to communicate a message.
Furthermore, reducing the physical size of each processing element is also of paramount importance, particularly where hundreds or thousands of separate processors are used in a single system, else that system would simply become too large, require an excessive amount of power and generally become quite impractical.
An underlying solution that would help attain this goal would be to transfer messages, particularly data, among processors in a manner that both is increasingly faster and uses much less memory than has been conventionally taught in the art. In this regard, system throughput can be significantly increased if each processing element could spend considerably less overhead time in transmitting and receiving messages. Furthermore, if intermediate memory requirements, as discussed below, associated with message passing could be reduced, even hopefully eliminated, then each processing element would require less circuitry and hence consume less physical space and entail less cost than heretofore needed.
In order to fully appreciate this significance of attaining this solution, one must first understand how messages, particularly data, are conventionally communicated between individual processing elements and the specific problems associated therewith.
Generally speaking, data takes many forms; however, physical memory takes only one form. In this regard, data can be structured in any of a wide variety of ways. For example, a data structure can be a simple list, i.e. a vector, of numbers with each number in the list simply following an immediately preceding number therein: one number directly after another. In this case, the data is organized in so-called "stride one", or a "linear" mapping, i.e. each data element successively and consecutively follows a previous one with no gaps therebetween. If data is organized in matrix form, stored column-wise and accessed along a column, then stride one data can be viewed as a column of a matrix with movement being from one data element to the next downward through each column. In essence, each successive element in the column can be reached by simply increasing an address for an immediately preceding element by "one". Linear mapping occurs within each column. If the matrix has "n" rows, then each row is said to be of constant stride "n". In this instance, to access successive elements in a row, a complete column of data, i.e. "n" data locations ("addresses"), must be skipped over to reach each successive data element. Here, the stride would be constant and regular at value "n". Matrices can also be sparse, i.e. a relatively small number of non-zero valued data elements is dispersed throughout a relatively large matrix with zero valued elements located everywhere else. An identity matrix having elements only along its diagonal is one such example. With sparse matrices, and depending upon the matrix location of the each non-zero element, these elements (at least not for an identity matrix) will likely have varying (non-constant) and irregular strides, in that the address increment needed to access each successive element, as measured from the location of a prior element, can and often does vary on an irregular basis throughout the entire matrix. Furthermore, not only can data be organized to possess irregular strides, these strides can be arbitrarily complex. In that regard, the location of the data in the matrix can be a function of the data itself. In that scenario, the location of a successive data element may not be readily determined until an immediately prior data element is found and the function calculated. For simplicity, arbitrarily varying strides will hereinafter be generically defined to include complex data strides.
Physical memory is only organized in stride one; it takes no other form. Each memory location immediately follows a prior location and can be accessed by simply incrementing the memory address of the prior location by one. In this regard, memory location zero is reached by incrementing the address of the last memory location by one thereby causing the memory address to "wrap".
Within physical memory and owing to various operating constraints (none of which is particularly relevant here), data can not always be stored simply in successive memory locations; in fact, individual items of data can be stored essentially anywhere in physical memory and, in fact, are often scattered somewhat throughout a given area in memory. As such, data is often stored in the form of a linked list. In particular, each element in such a list typically contains two fields: a data field and an address pointer to the location in memory of the next successive element in the list. In a simple one-dimensional data structure, such as a vector, each item in the list contains one data field and a single address pointer to the next element in the list, and so forth until the last element is reached which contains a corresponding data element but with "null" valued pointer. Alternatively, in the case of a two-dimensional matrix data structure, depending upon the ordering of the matrix elements, each list element may contain two pointers: one designating a list element that contains the next successive data element in a horizontal matrix direction and the other pointer designating a different list element that contains the next successive data element but in a vertical matrix direction. In this regard, to conserve memory space, only non-zero elements in a sparse matrix can be stored efficiently in physical memory.
Unfortunately, inasmuch as individual list elements in a data structure can be scattered throughout memory, the stride associated with accessing each of these individual elements in that structure may, a priori, be unknown. Accordingly, this necessitates that in order to access a given list element in a structure, all prior elements must be accessed in order to obtain the pointer to the desired element. In contrast, where data is organized at a constant (regular) stride in memory, or preferably with stride one, direct memory access (DMA) can be used to provide highly efficient data transfer since constant address increments can be used to access each successive data element. However, as will be seen below, owing to the need to calculate a proper varying memory increment for each data access, use of DMA becomes very inefficient for transferring data stored with arbitrarily varying strides.
Given the idiosyncrasies associated with storing data of arbitrarily varying strides in physical memory and accessing the data therefrom, complications and inefficiencies arise whenever that data is to be communicated from memory associated with one processing element to memory associated with another such element. Specifically, a data message by its very nature contains a linear succession of data values--one value follows the next; hence, a message only contains stride one data--regardless of the actual structure of the underlying data. Consequently, as conventionally taught in the art, each node in a packet communication system typical contains two separate I/O buffers: an input (transmission) buffer into which an outgoing message is fully assembled prior to transmission and an output (receiving) buffer into which an incoming message is completely built prior to its subsequent use. The use of such buffers, in the general context of packet communications, is typified by the systems described in U.S. Pat. Nos. 5,151,899 (issued to R. E. Thomas et al on Sep. 29, 1992); 4,858,112 (issued to B. G. Puerzer et al on Aug. 15, 1989) and 4,555,774 (issued to L. Bernstein on Nov. 26, 1985).
Hence, to transfer data of an arbitrarily varying stride, particularly from one processing element to another in a massively parallel processing system, the communication sub-systems, in the transmitting and receiving processing elements, are required, due to the need to route the data through stride one I/O buffers, to convert the data to and from stride one. In particular, consider the following example, where a data structure of, e.g., an arbitrarily varying stride x (where x is a varying integer), is to be transmitted by application program A executing at a source processing element to application program B executing at a destination processing element. As conventionally taught, in response to a command issued by application program A, a communication sub-system employed within the source processing element would successively access, from memory, and copy each and every item of the data structure that has been stored in a linked list through a so-called "gather" operation. Through this operation, successive linear locations in a stride one output buffer would be filled, on a one-for-one basis, with successive items of the data in this linked list. Once the entire data structure has been stored within this buffer, the communication sub-system would then append appropriate message header and trailer information to the buffer contents to form a complete packet and thereafter transmit the entire packet through the communications fabric to the destination processing element. In response to the incoming packet, the communication sub-system executing at the destination element would then serially fill an input buffer with the complete message as it is received. Once this buffer has captured the entire message, this sub-system would perform a so-called "scatter" operation to re-create the linked list in memory for subsequent use by application program B (though, due to memory constraints thereat, at typically different memory locations and often at a different arbitrarily varying stride than those used at the source processing element). Specifically, this "scatter" operation entails individually copying each and every data item from the input buffer and successively storing that item in the destination processing element memory such that the incoming data is distributed throughout the memory with stride y (with y being an arbitrarily varying integer generally not equaling x) as required by application program B).
Unfortunately, processing time is consumed in performing "gather" and "scatter" operations. Now, regardless of where these overhead operations are actually performed within each processing element, i.e. in either the communications sub-systems (as discussed above) or by the application programs themselves executing in these elements, the amount of processing time required by these operations decreases the processing time that is otherwise available at that element, hence decreasing its application throughput.
In fact, the overhead associated with communicating large amounts of data through stride one I/O buffers tends to seriously degrade overall system performance. In particular, at a transmitting end, where data is organized in arbitrarily varying, and particularly a complex, stride, a significant amount of overhead processing time can be consumed in just calculating the proper address increments to access each successive data item from memory. In the absence of sophisticated DMA circuitry (which, for reasons of simplifying circuitry and reducing cost, is generally not used in a massively parallel processing element), this overhead can be substantial for a large amount of data and thus inject substantial latency into the system. Furthermore, serious delays can occur on the receiving end owing to the finite size of the input buffer. In particular, if a large amount of data that exceeds the size of an input buffer is to be received, then a so-called "rolling window" technique must be used to transmit only as much data, in any one message, as will fill the input buffer. To prevent congestion and possible over-writing, the receiving processing element must utilize flow control, in conjunction with the transmitting element, in order for the receiving element to fully "scatter" its received data and thus fully empty its input buffer before receiving any further data from the transmitting element. Consequently, in practice, the limited size of the input buffer within each processing element also limits elemental and often system throughput.
Thus, a need exists in the art for a protocol, particularly apparatus and an accompanying method therefor, for use in a distributed processing environment, with particular though not exclusive attractiveness in a massively parallel processing system, for efficiently handling inter-processor element transfers of data with arbitrarily varying strides. Such a protocol should preferably eliminate the need for routing incoming and outgoing message data through I/O buffers in each processing element. Removal of these buffers would advantageously eliminate the need to copy the data, both on transmission and reception, thereby significantly reducing processing time, i.e. overhead, required to facilitate message passing. This, in turn, would free processing time for each such element thereby increasing application throughput of that element and concomitantly of the entire system. In addition, by eliminating these buffers, each processing element would become simpler, and require less circuitry and cost than has been required heretofore. This, in turn, would reduce the cost of the entire processing system. Moreover, by removing these buffers, the physical size of each element could be advantageously reduced, which, given the sheer number of such elements used in a massively parallel processing system, could, among other benefits, advantageously and significantly reduce the size of the entire system.