The general advantages of distributed computing are well-known. As computing power becomes more widely available at lower prices, the most cost-effective approach to many problems may involve harnessing many connected processors together into one large system. Some computing problems, such as tracking retail sales and inventory, are inherently distributed. Distributing computing workloads may also improve reliability, since the failure of a single processor in a distributed system will not necessarily bring all work on a given problem to a halt.
A variety of tools are available for building distributed computing systems. Shared memory, remote procedure calls, "blackboards," event-driven modules, and other mechanisms allow communication between processes that are running in a cooperative manner in separate memory regions on one or more computers. These mechanisms, in combination with networking protocols, domain name systems, distributed operating systems, and distributed file systems, allow communication between processes running on separate computers and efficient use of resources by such processes. The Internet, local area networks, metropolitan area networks, wide area networks, wireless networks, satellite networks, optical networks, and other collections of connected and/or connectable computers provide processing power, memory, disk space, and other resources, including facilities for inter-process communication. A host of secret key, public key, and other cryptographic methods can be used to enhance the security of interprocess communications. Java, Ada, C++, assembly, and other programming languages or development environments support the creation, testing, and refinement of interrupt handlers, concurrent processes, threads, multiprocessing systems, exception handlers, and other concurrent and/or distributed programming constructs.
As a result, many different approaches to distributed computing have been tried, and even more have been proposed. Each distributed computing system, whether it has been implemented or not, embodies numerous design choices, making it one approach selected from an enormous universe of possibilities. Some of the most important design choices include deciding how the distributed processes communicate with one another and with users, how security constraints are defined and enforced, how and when processors and processes should be brought together and separated, how responsibility is divided between processes, how processes are updated to reflect new data or instructions, and how processes should detect and handle errors.
Each of these broad design questions leads to additional, more specific questions. For instance, determining how to match processors with processes typically involves (among other considerations) selection of a processor allocation algorithm. As explained in the text "Distributed Operating Systems" by Andrew S. Tanenbaum, ISBN 0-13-219908-4 (1995), this choice in turn involves key choices between deterministic versus heuristic algorithms, centralized versus distributed algorithms, optimal versus suboptimal algorithms, local versus global algorithms, and sender-initiated versus receiver-initiated algorithms.
When security, error control, communication, update propagation, and other broad design issues are considered in greater depth, they likewise give rise to a host of additional, more specific questions. Should memory be shared? If so, how should consistency between different copies of the "same" data be maintained? What sort of errors can be detected by a process, and how should each type of error be handled? What should a given process do itself, and what should it ask other processes to do? What formats should instructions and data be stored in? How closely should a given process be tied to the specific hardware and operating system of the computer it is running on? Should a program be loaded into memory for execution as one complete, self-contained block or should components be loaded only as they are needed? How should version control be accomplished ?
The design task is made even more difficult by the fact that answering one design question in a particular way may change the importance of other questions or raise new ones. For instance, if a process is sufficiently independent of the hardware being used to avoid disk accesses after being launched, then the process can run on both diskless computers and on computers that have a local disk. Disk storage formats for use while the process runs apparently become irrelevant while the options for recovering from serious errors are different than they would be if checkpoints could be logged on a local disk.
Known distributed system architectures answer design questions in very different ways, but some design questions tend to be answered in the same or similar ways in most systems. For instance, most systems contain a specialized process that is responsible for matching other processes with available processors. This "process manager" is also known by names such as the "scheduler," "load-balancer," "transfer manager," "usage table coordinator," "process queue manager," and "processor allocator." Decisions about how best to allocate processors are made by the process manager, sometimes with little or no input from the processes that will run on the assigned processors. Some systems include one process manager per processor or one per computer, rather than one for the entire distributed system, but all of the process managers in a given system typically use a single algorithm to match processes with processors.
Likewise, in most known distributed systems, the types of errors that can be detected are limited to (a) input errors, and (b) missing or unavailable resources. For instance, data input from a file, a socket, or a user can be checked for values outside a predetermined range, or it can be checked against another copy of the data. The contents of a network packet or a Java applet may be checked by calculating a checksum and comparing it with a checksum computed earlier. If the comparison detects an error, packet retransmission or applet reloading can be requested, or the user can be asked to supply different content.
With regard to resource errors, a process may determine that files, such as dynamically linked library files or requested text files, cannot be found at the expected location and may then search other locations. A process may also determine that a telephone line, network socket, memory, disk space, or other requested resource is unavailable, and try several times to obtain the resource before warning the user or failing.
However, errors other than input or missing resource errors may go undetected until important information is corrupted or lost. In particular, processes do not typically detect corruption of their own internal structure while they are running, and instead of reacting gracefully to such errors, most processes fail catastrophically. Some processes do use exception handlers to limit the impact of serious errors after they occur, but still fail to detect corruption before the corrupted structures are relied upon.
Most computer programs, whether capable of distributed processing or not, need to be updated from time to time. Depending on the program's architecture, updating can be time-consuming, error-prone, and/or inflexible. Many programs are provided to users as large self-contained pieces of code. Over time, these monolithic agglomerations of code have grown quite large for popular applications such as word processors and spreadsheets. Updating such a monolithic program often involves deleting all of the existing code, regardless of whether it is different in the new version, and installing the new version of the program. In some cases, object code "patches" are used instead, and the amount of code replaced is smaller, but patches are normally used only for localized changes to a program, not for fundamental or widespread changes.
Some programs are less monolithic, being split into a main routine such as an event handler loop and a collection of dynamically loaded components. An update may then involve merely replacing one or more relatively small components. Dividing functionality between components also has the advantage of allowing two or more different programs to use the same component. For instance, an email program and a word processor could use the same spell checking code. Once they are loaded, however, such components generally stay in memory until all programs using them have finished executing (and some persist even after that time). Thus, even when dynamically linked libraries are used, updates to a program's behavior can often be made only after the program finishes its current work and stops running.
Some specialized programs reduce the need for updates by "learning" while running. For instance, neural net programs may alter the relative numeric weights assigned to connections between nodes in a neural net, thereby altering the program's response to inputs of a certain kind. Likewise, so-called "genetic" algorithms use permutations and optimality measurements to adjust successive generations of a program, eventually producing a program that is better than the initial program at optimizing some specified condition.
But when the code for measuring genetic optimality or adjusting neuron weights needs to be changed, for instance, even programs that "learn" while they run are updated using conventional techniques. Thus, difficult programming may be needed to change the input types accepted by a neural network. Neural networks also do not readily perform computing system functions such as keeping track of time or other system resources.
Thus, it would be an advancement in the art to provide an improved system and method for distributed computing which allow processes to detect corruption of their own internal structures before relying on the corrupted structures during execution, and which allow processes to react gracefully to such errors.
It would also be an advancement in the art to provide a system and method for distributed computing which provide a processor allocation scheme that is more flexible and better tailored to the needs of individual processes than current schemes.
It would be an additional advancement in the art to provide a system and method for updating distributed computing processes in a way which is more powerful than using patches, more efficient than replacing entire monolithic programs, and more flexible than the limited behavior changes available through neural net and genetic programming "learning" methods.
It would be a further advancement if such a system and method could be implemented in a manner which is compatible with current networks and their protocols, and which takes advantage of suitable current programming language features and security methods.
An architecture for such a distributed computing system and method is disclosed and claimed below.