There are a number of situations in which it is desired or required to synchronize the actions of multiple computers. This invention arises in the context of a continuous media server being developed by Microsoft.RTM. Corporation to independently supply multiple video streams from a cable system headend to set-top boxes in individual subscriber's homes. One architecture under investigation uses a plurality of cooperating data handlers to supply the video streams. Each data handler comprises a conventional computer with a high-capacity disk storage system.
Superior-quality performance is vital to a video service, and repairs and maintenance must be invisible to users. The Microsoft.RTM. architecture features optimal fault tolerance by eliminating all single points of failure and incorporating a repair strategy that minimizes support time. No single hardware failure can shut down the system, so viewers always receive uninterrupted service. In the event of a hardware component failure, the system automatically reconfigures itself to prevent any disruption in data distribution. For example, if one data handler fails, the system redistributes that data handler's responsibilities to other data handlers. Once repairs are made, the system automatically reintegrates the repaired data handler back into operation, while maintaining continual data flow to viewers.
This performance is achieved by distributing data and responsibilities among the plurality of data handlers. For instance, the data for a single video stream is distributed over all the data handlers. The data handlers cooperate to merge their various portions of data into a single continuous data stream. In many cases, it might seem desirable to have a single controller computer for synchronizing and directing the data handlers. However, this would result in a single, critical point of failure. Instead of using this approach, the Microsoft.RTM. architecture distributes control responsibilities among the data handlers themselves.
The data handlers are logically connected in a ring topology for data communications with each other. When creating a continuous data stream, a first data handler supplies the first segment of data and then passes responsibility to a second data handler in the ring. The second data handler supplies its data and passes to the third, and so on. Messages are continually passed around the ring to coordinate tasks.
A failure in an individual data handler is handled by bypassing it--the failed data handler is simply cut out of the communications ring and the two surrounding data handlers henceforth communicate directly with each other. Repaired data handlers simply insert themselves into the communications ring.
In order to merge their data, the data handlers must be in very close time synchronization. The particular ring-type architecture of the system, however, makes such synchronization difficult. Many synchronization systems rely on a "master" time-keeper computer, to which all other computers in the system synchronize. In the dynamically configured system described above, however, having a dedicated master is not desirable since it would introduce a single critical point of failure. Furthermore, it is not desirable from a control standpoint to allow different data handlers to act as masters at different times, or to allow each data handler to act as a master to a succeeding data handler. The prior art apparently does not include an adequate method for local clock synchronization in a dynamically changing distributed computing environment such as described above.