Distributed computer systems may operate in a distributed environment where many different systems or nodes communicate across a network to perform common tasks. Particularly, a distributed system is a collection of independent computers that appears to its users as a single coherent system; the individual differences between other system members as well as the system's internal organization may be hidden. Furthermore, users and applications may interact with a distributed system in a consistent and uniform way, regardless of where and when the interaction takes place. These systems may also be substantially scalable as no single element may control the system's behavior. Further, distributed systems may be continuously available, although an individual member may become temporarily or permanently unusable. Users or applications of a distributed system may not notice or control system maintenance or system performance. In short, a distributed system may easily connect users with resources, it may hide the fact that resources are distributed across a network, it may be open, and it may be scalable.
For example, the Internet Domain Name System (DNS) is an example of a widely-used distributed system. DNS is primarily used for looking up Internet host addresses and mail servers. The system is hierarchically organized as a rooted tree. Each node or leaf in the tree is associated with resource records that hold the information associated with the domain name. One of the functions of the DNS is associating Internet protocol addresses with domain names. The DNS is implemented as a hierarchical set of servers wherein each domain or subdomain has one or more authoritative DNS servers that publish information about that domain and the name servers of any domains beneath it. To resolve a domain name, a local host need only have information concerning the topmost DNS server. The local host will query the topmost DNS server for the location of a particular domain. The topmost DNS server, in turn, will point the requesting entity to a lower-level server of the DNS until the domain name is finally resolved. Therefore, the task of resolving Internet domain names may be distributed across several nodes of the DNS.
The Peer Name Resolution Protocol (PNRP) is another example of a distributed system. P2P distributed systems may be composed of many individual peer computers called nodes. PNRP may enable each node to dynamically publish and resolve names, register multiple names on a node, register multiple computers to a single name, and register names for applications.
Regardless of the distributed system, network variables that may be generally transparent to the individual user may determine how data propagates through the system and how each node processes the data. For example, a node “timeout” variable may define how long the node may try to connect to another node before recognizing connection success or failure. Also a system may share a common, Distributed Hash Table (DHT) of key/value pairs. Distributed hash tables may partition ownership of the keys among system nodes and may efficiently route messages or files to the unique owner of any given key. Each node may be associated with a “keyspace” or set of unique keys, and each node may store data associated with all keys in the node's keyspace. However, a system implementing a DHT may behave differently depending on the value of a network variable setting the size of the keyspace. For example, a system including a network variable that assigns only one key to each node may result in very high data lookup costs, while a system with many keys assigned to each node may require each node to store a large amount of data. For any distributed system using a DHT, the overall system may behave differently by either globally or locally adjusting the keyspace assignment variable.
Finally, P2P systems may maintain a routing table at each node to direct messages traveling through the network. However, each node may not need to store a routing table describing every system node to achieve a suitable level of accurate message routing. A system allowing a very large routing table may route messages more efficiently because each node will have more information about the location of a destination node. However, maintaining the accuracy of a large routing table may be too cumbersome for the system. Each time the routing table must be refreshed, the table information must be transmitted to each participating node. As more nodes are represented on the routing table, more nodes must necessarily receive the new routing table information. Likewise, a system allowing a small routing table may be less cumbersome for system maintenance, but messages will take much longer to arrive at their destination node because each node only has a limited amount of information concerning the location of other nodes on the network. Further, this increase in transmission time may result in more lost or “timed out” messages depending on the value of the previously-described system timeout variable. As messages take more time to propagate through the network, the message may exceed the network timeout variable. Many other system variables that are transparent to the user may affect the overall performance of the distributed system.
As explained above, each network variable may individually affect the performance of the distributed system, however, each variable may also interact with other network variables. The variables may be changed and the distributed system may be observed in operation or simulation to determine the effect of the variable adjustment on the system performance. The performance of a distributed system may be monitored or recorded by the method disclosed in U.S. application Ser. No. 11/089,894, “System and Method for Monitoring and Reacting to Peer-to-Peer Network Metrics” to Horton et al., the entire disclosure of which is hereby incorporated by reference. The distributed system performance may be generally described as a system “health index.”