In conventional computing, a single computer system is often utilized to perform operations on data. The operations may be performed by a single processor, or central processing unit (CPU) within the computer. The operations performed on the data may include numerical calculations, or database access, for example. The CPU may perform the operations under the control of a stored program containing executable code. The code may include a series of instructions that may be executed by the CPU that cause the computer to perform specified operations on the data. The capability of a computer in performing operations may variously be measured in units of millions of instructions per second (MIPS), or millions of operations per second (MOPS).
Historically, increases in computer performance have depended on improvements in integrated circuit technology, often referred to as “Moore's law”. Moore's law postulates that the speed of integrated circuit devices may increase at a predictable, and approximately constant, rate over time. However, technology limitations may begin to limit the ability to maintain predictable speed improvements in integrated circuit devices.
Another approach to increasing computer performance implements changes in computer architecture. For example, the introduction of parallel processing may be utilized. In a parallel processing approach, computer systems may utilize a plurality of CPUs within a computer system that may work together to perform operations on data. Parallel processing computers may offer computing performance that may increase as the number of parallel processing CPUs in increased. The size and expense of parallel processing computer systems result in special purpose computer systems. This may limit the range of applications in which the systems may be feasibly or economically utilized.
An alternative to large parallel processing computer systems is cluster computing. In cluster computing, a plurality of smaller computer, connected via a network, may work together to perform operations on data. Cluster computing systems may be implemented, for example, utilizing relatively low cost, general purpose, personal computers or servers. In a cluster computing environment, computers in the cluster may exchange information across a network similar to the way that parallel processing CPUs exchange information across an internal bus. Cluster computing systems may also scale to include networked supercomputers. The collaborative arrangement of computers working cooperatively to perform operations on data may be referred to as high performance computing (HPC).
Cluster computing offers the promise of systems with greatly increased computing performance relative to single processor computers by enabling a plurality of processors distributed across a network to work cooperatively to solve computationally intensive computing problems.
In many distributed computing systems, a computer may communicate information to each of the other computers in the computing cluster. One method for communicating the information may utilize multicasting. Some conventional distributed cluster computing systems implement multicasting in application gateway servers. The computer, which is the originator of the multicast, or the source computer, may send information to the application gateway server. The set of computers, which are to receive the data, may be referred to as a multicast group. The application gateway server may then store a copy of the received information and subsequently communicate the information to each of the computers in a multicast group. The application gateway server may communicate the information to the multicast group via a reliable communication protocol, for example, transmission control protocol (TCP). Upon receiving indications that each of the computers in the multicast group has received the information, the application gateway server may no longer be required to store the information. Consequently, the information may be released from storage at the application gateway server. In large cluster computing systems, the quantity of storage required at the application gateway server may impose a burden that may reduce the performance and/or cost effectiveness of the cluster computing system. This burden is imposed even when message storage is limited to non-persistent media, such as system memory.
In addition to distributing information among computers in a computing cluster, the processing tasks performed by each of the computers may be coordinated. The task of coordinating the tasks performed by each of the computers in a computing cluster is referred to as synchronization. Synchronization involves dividing a computing task into stages, referred to as epochs. The computers in the computing cluster may each perform different tasks in a portion of their respective epochs. The computers in the computing cluster may operate on different portions of aggregated data in a given epoch. In some cases, however, the ability of a computer to begin a subsequent epoch is dependent upon another computer in the cluster having completed a prerequisite epoch. The computer may rely upon the results of data processed during the prerequisite epoch when performing further processing in the subsequent epoch.
In some conventional distributed cluster computing systems, the problem of synchronization may be addressed by utilizing semaphores, tokens, or other locking techniques that enables a computer to perform an operation when a precondition has been satisfied. For example, the completion of a prerequisite epoch is an example of a precondition that may be required to be satisfied.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.