In update and query systems requiring real time response, the approach to guarantee the fastest access to data is to hold the data in physical memory. In situations where that data is crucial to the operation of the querying device, redundancy is also implemented such that a failure of a single hardware element storing this information does not prevent subsequent successful queries. This situation is compounded in systems that are highly distributed and where the storage of data is decentralized among peer devices.
Similar problems have been solved in a variety of ways. Many of these solutions rely on a master source for the stored data or at least an ability to re-fetch that data. Such is the case with the use of network caching equipment. In the event of a cache failure, a backup cache simply re-fetches the data from the originating store. In the case of commercial databases, there are replication schemes, journaling, and disk based backups using periodic push/update techniques and write through secondary servers. However, these approaches depend upon a fairly centralized storage system.
A traditional fault tolerant system uses N+1 devices where N devices carry the capacity and the +1 device is a in a hot standby mode. When a failure occurs in one of the N devices, the +1 device takes over but it must disrupt the system to learn the state of the device it is replacing since it cannot know the state of every possible device in the system it might have to replace. As a result, the industry has gone to a 1+1 scheme where every device has its own dedicated backup which maintains its partner's state so that failures can be seamless and not disrupt the system. However, in this scheme, half of the devices are sitting idle and the total system requires 2N devices for implementation.
Another scheme, RAID redundancy, does not require 2N devices but uses a centralized controller. Various categories of redundancy can be configured at the controller to mirror data between storage devices and tolerate individual hardware failure. However, a failure at the controller would produce a devastating effect to the network employing this scheme. Thus, it would be desirable to provide a scheme that avoids system disruptions in the event of a failure while reducing the number of devices that sit idle during normal operation.