1. Field of the Invention
This invention relates to a distributed computing architecture and, more particularly, to a distribution element for configuring the distributed computing architecture wherein the distribution element, through its plurality of independent input/output ports, implements lossless flow control without reservation and direct multicast transfer capability.
2. Description of the Background
The development of computing and communication technology in the last decade has resulted in very high speed individual processors and high speed data transfer media. The emerging new generation of distributed applications, however, still lacks a communication architecture that can efficiently exploit the high performance of data transfer and computing devices. In addition, any new architecture has to support the heterogeneity of communicating agents and satisfy their needs in a transparent way. The architecture should also provide a functional interface to the applications which will significantly reduce distributed software development cost, which is already surpassing the cost of hardware communication installations by at least an order of magnitude.
Currently, data transfer networks lack both performance and functionality. Network installations have transfer speeds of up to 10 Mb/s, and even lower bandwidth because of protocol complexity. This performance is one to two orders lower than requirements of emerging distributed applications. Current networks are based on a message passing paradigm, which is fundamentally different from the distributed program paradigm of shared objects. Current interconnection techniques do not support efficient sharing and coordination, nor do they provide simple integration of heterogeneous applications. The lack of these functionalities is overcome by complex protocols executed by communicating nodes, imposing fundamental performance bottlenecks which cannot be solved within the framework of current architectural alternatives. This problem is exacerbated because the host computers are independent units, usually of different architecture, performance, and even operating systems. Also, the hosts are distributed. Moreover, the interconnection medium between hosts can be of any type and speed.
Unfortunately, the computing power offered by numerous distributed computers is not utilized well. Although many computers in the network can be idle, those users with high computing requirements may be eager for resources that exceed the capabilities of their individual workstations and be unable to use the power of the units attached to the network. This situation obtains because there are no networks that enable sharing of multiple computing resources efficiently and transparently to the application. If a user wants to exploit the power of the units in the network, the user has to create an application that can be scheduled on multiple units for concurrent execution. Such an application would have to perform overhead of parallel execution in order to coordinate synchronization and communication between individual computing tasks. This coordination assumes the user's knowledge of the underlying architecture, which the user usually does not have and does not wish to obtain.
The objective of a distributed system design is to abstract the network so that it appears to the user as a single computer, although it enables use of computing units on the network based on the user's requirements and the availability of units. The multiplicity and distribution of units should be transparent to the user. This network abstraction has to be supported by the architecture of interconnects that comprises the network and by the network operating system.
Current distributed networks are based on devices that were designed specifically to connect two remote units. The networks are thus designed to explicitly enable well-defined communication between point A and point B, where all that point A was asking from the network was to send information to point B. The world of today's distributed applications is very different from this simple communication paradigm. The environment of distributed applications is based on a set of "shared objects", where the objects can be placed at a single site or multiple remote sites. Certain operations on objects performed by one unit can have as side effects communications with the other units in the distributed system. Very often the existence and the location of these units are not even known. The communication patterns between the units can be very diverse: one-to-many, many-to-one, one-to-one among many and so forth. The current networks simply do not provide adequate functions to fulfill requirements of distributed applications. These functions include: a different addressing mechanism that reflects the notion of sharing; reliable and lossless data transfer; direct support for different communication patterns; support for efficient change of communication patterns; and a synchronization mechanism that will efficiently control the sharing of objects.
In order to implement distributed applications on current network installations, the user community is attempting to devise arrangements to overcome this lack of functions by building network operating systems that bridge this gap between the distributed applications requirements and the function of current interconnects. Such arrangements provide applications with necessary functions by abstracting this point-to-point paradigm of the networks into the shared object world seen by applications. The fundamental problem of this approach is that operating system involvement is very inefficient since not only many frequently executed communications and synchronization functions are executed by software, but all that is provided to the network operating system by the interconnects is still just unreliable and lossy point-to-point transfer.
Furthermore, the control of even the simplest point-to-point communication pattern is very complex in current networks. The flow control and routing are resolved by the high level communication protocols, rather than directly by distribution elements. Not only is protocol execution slow because it is performed by software, but it is mostly complex because current control methods allow spreading of the congestion and control information throughout the network and may involve all units in the system. Data losses are unavoidable, and their recovery slow and complex.
Functional requirements of distributed applications cannot be efficiently supported by distribution elements if losses are allowable in the network. For instance, the semantics of shared object update operation is severely corrupted if data is lost on the way to one of the units but not on the way to the others. In order to preserve semantics of such operation in the lossy environment, the problem of canceling the effect of the operation or acknowledgments between all the units that share the object before the operation takes effect must be handled.
The applications that require high communication network bandwidth belong to two classes. The first class relates to applications that involve exchange of large amounts of data between geographically dispersed agents that can produce and/or consume data at the rate of hundreds of Mb/s. The computing and data acquisition and display devices that operate at this data rate became very common, as is best illustrated by the fact that high performance supercomputers, workstations and peripherals are coming out with a HIPPI 800 Mb/s interface. The other class of applications are computationally-intensive applications that were traditionally executed on very expensive parallel supercomputers, whose computational requirements can now be satisfied with a large number of powerful workstations, which are available in many organizations, provided that they are interconnected with high bandwidth networks.
The recent advancement in the optical amplifiers technology has made the design of Gb/s fiber links feasible at reasonable cost. The improvement in the semiconductor technology, where sub-nanosecond gate delays are already reality, has made possible commercial designs of the very fast transmitters and receivers with parallel-to-serial converters (and vice versa) that support rates of up to 1 Gb/s. The transfer rates across point-to-point links are much higher than the rates across busses realized with the same transmission medium, due to the simpler access protocol and the electrical interfaces.
The speed of the high performance network hosts has also dramatically improved over the last years. High performance supercomputers capable of performing 1 billion operations per second are already a reality. The supercomputers is that produced 10 to 50 MFLOPS required approximately 10 Mb/s I/O bandwidth, so that it is expected that the contemporary supercomputers will request a Gb/s rate of I/O bandwidth.
With a high speed transmission links and hosts capable of pumping them at the Gb/s data rate, the speed at which communication protocols are executed in order to minimize packet loss and buffer overflow becomes a bottleneck of high speed distributed computing systems.
As already alluded to, current networks are based on an I/O network abstraction. Every object transferred through the network requires an explicit I/O request by the application. Since I/O operations are protected operations not directly accessible to the application programs, application I/O requests are translated into the operating system calls. The operating system and the network communication control programs are responsible for bridging the gap between the application, which is essentially based on memory abstraction, and network communication, which is based on I/O abstraction. The transformation between the two abstractions is performed by complicated multilayered protocols that require substantial amount of processing and copying of data. The protocol overhead is the reason why interprocess communication over the network has achieved a small portion of the network bandwidth, even if the fastest processors are used for today's relatively slow (10 Mb/s) networks. As indicated above, the new data transmission technology offers two orders of magnitude larger transmission rates (.sup..about. Gb/s), whereas processing speed did not improve by more then a factor of 10. Thus, it can be expected that protocol overhead will be even larger bottleneck in the faster networks.
Why does the difference in memory abstraction of application and I/O abstraction of the network communication require such extensive amount of processing and memory accesses? First, the necessity of operating system involvement in I/O transfer is time consuming due to context switching and data copy. Second, memory abstraction views data as persistent and consistent global name space. On the contrary, I/O abstraction of current networks has inherent notion of loss, thus maintaining applications requirement for persistence through the complex and time consuming loss detection and recovery mechanisms. Furthermore, addressing is based on physical locations of destinations, and an application's view of global name space is maintained through numerous translations of global names into the physical network addresses. There is no provision for consistency by network protocols, and its support is left to the application or operating system.
The loss in the current networks occurs for the following reason. Bandwidth for the required data transfer is not reserved on the whole path between source and destination, since it would be time consuming, and would result in a large waste of communication bandwidth. Data is rather partitioned into individual packets, and each packet routed separately based on the physical address of the destination and the assumption of the bandwidth availability of the links between neighboring nodes in the network. Since there is no explicit information of bandwidth availability, but only assumption based on statistical reasoning, losses due to data congestion are unavoidable.
Thus, in order to simplify the interface to application and protocol complexity, the architecture of communication devices should be based on memory abstraction. For that purpose, the architecture and its implementing device should provide means for lossless communication, routing based on the global name space and mechanisms for maintaining consistency of global name space.
The network control functions have not been provided for effectively in the prior art and, accordingly, must be addressed by any newly devised architecture and concomitant implementing element are now discussed in further detail.