A distributed system is a collection of autonomous computing entities, hardware or software, connected by some communication medium. While often the computing entities are geographically dispersed, in some instances they might be separate processors in a multi-processor computer or even separate software routines executing in logically isolated memory space on the same computer. A computing entity need not be a traditional computer, but more generally can be any computing device, ranging from a large mainframe to a refrigerator or a cell phone. A distributed application is an application that executes on a distributed system and one in which parts of the application execute on distinct autonomous computing entities.
Whenever a distinct component of a distributed application requests something (e.g., a data value, a computation) of another component, the former is called a client and the latter is called a service. It is worth noting that the terms service and client are not exclusionary in that an item can be both a client and a service. For example, a routine that calculates the time between two events may be a client and of a clock service; if the clock service then calls a routine that converts to Daylight Savings Time, the clock becomes a client and the Daylight Savings Time converter is its service.
FIG. 1 shows a typical distributed application of the existing art. There are two clients 2, 4 and four services 10, 12, 14, 16 that the clients 2, 4 might need. Each service has a service proxy 10a, 12a, 14a, 16a which is a module of mobile code that can be used by clients to invoke that service. A service proxy 10a, 12a, 14a, 16a contains the code needed by a client 2, 4 to interact with a service. For instance if a service is a digital camera on a robotic arm, the interfaces might include Initialize( ), Zoom( ), Rotate( ) and Get_Picture( ). The service proxy 10a, 12a, 14a, 16a may also provide the expected return values for the service, which might include error codes as well.
Mobile code generally refers to a computer program that can be written on one platform and executed on numerous others, irrespective of differences in hardware, operating system, file system, and many other details of the execution environment. In addition to independence from the physical characteristics of the execution environment, a mobile program may move from one computer to another in the middle of its execution.
Mobile code may be pre-compiled, or compiled when it arrives at the execution platform. In the first case, numerous versions of the program must be written and compiled, then matched across run-time environments; this is mobile code in the letter, but not the spirit, of the definition. In addition, the same pre-compiled program cannot move from one platform to a different one during its execution. In the second, the program text may be distributed along with configuration scripts describing what to do in each execution environment. This distributes and delays the specificity of the pre-compiled option. The more interesting, and far more common approach exploits a standard virtual machine, which finesses all the issues of platform heterogeneity. The virtual machine is a program that itself mitigates the machine dependencies and idiosyncrasies, taking the raw program text and compiling it into binary executable.
In addition to clients 2, 4 and general services 10, 12, 14, 16, all distributed applications need some mechanism for clients 2, 4 to find services. Often such knowledge is assumed a priori, but many distributed applications use a look-up service 20. The look-up service 20 is a service with which the other services are registered or advertised to be available to for use by clients. In a simple system, where there is no attempt to coordinate replicas of services, each new service registers with the look-up service 20 (in the case of replicas, the onus falls on the client to resolve conflicts and ambiguity). When a service 10, 12, 14, 16 registers, it provides information telling clients 2, 4 how to find it. Commonly, this is a physical location such as an IP address and port number, but in the most modem systems this can be as powerful as giving the look-up service 20 a service proxy 10a, 12a, 14a, 16a, which is actual mobile code that clients 2, 4 can execute and use to invoke that service 10, 12, 14, 16. In this way, the service proxy 10a, 12a, 14a, 16a contains not just location information, but information for how to use the service 10, 12, 14, 16. While just as necessary for the client 2, 4 as location information, this has previously been assumed as a priori knowledge. When a client 2, 4 wishes to work with a service 10, 12, 14, 16 it finds it through the look-up service 20, downloads the service proxy 10a, 12a, 14a, 16a for that service 10, 12, 14, 16 from the look-up service 20, then uses the service proxy 10a, 12a, 14a, 16a to invoke the service 10, 12, 14, 16. The look-up service 20 may also have attributes of the services 10, 12, 14, 16, such as whether it is a grouped service, what type of group it is, what its cost to use is, how accurate it is, how reliable it is, or how long it takes to execute. In such cases the clients 2, 4 can use the attributes to decide which of a number of services 10, 12, 14, 16 it wishes to use.
Each of the foregoing has access to a communication network 22 so that it is capable of communicating with at least some of the other members in the distributed computing application. The communication network 22 may be wireless, a local area network, an internal computer bus, a wide area network such as the Internet, a corporate intranet or extranet, a virtual private network, any other communication medium or any combination of the foregoing.
In the prior art example shown in FIG. 1, one client 2 is a traffic monitoring program that notifies a user when and where traffic has occurred and the other client 4 is an automated toll collection program. The services are a clock 10, a road sensor 12 that monitors traffic flow on a highway, a toll booth sensor 14 that detects an ID device in each car that passes through the toll, and a credit card charge program 16. When each service 10, 12, 14, 16 becomes available to the application it registers with the look-up service 20 and provides the look-up service with its service proxy 10a, 12a, 14a, 16a. 
When the traffic monitoring client 2 begins, it queries the look-up service to see if a clock is available and what sensors are available. The look-up service 20 responds by providing the client 2 with the clock proxy 10a, the road sensor proxy 12a and the toll booth sensor proxy 14a. The traffic monitoring client 2 uses the service proxies 10a, 12a, 14a to invoke the clock 10 and the sensors 12, 14, and then to monitor traffic at various times of the day.
Similarly when the toll collector client 4 begins, it queries the look-up service 20 to see if a toll booth sensor 14 and a credit card charge service 16 are available. The look-up service 20 responds by providing the client 4 with the toll booth sensor proxy 14a and the credit card charge proxy 16a. The toll collector client 4 uses the service proxies 14a, 16a, to invoke the toll booth sensor 14 and the credit card charge program 16, and then to identify cars that pass through the toll booth and charge their credit cards for the toll.
A known feature of distributed applications is that services may be grouped. For instance there may be several services capable of performing the traffic sensor functionality. These can be grouped to form a logical notion of traffic sensor that is separate from the particular implementation of the sensors. This may be done for redundancy purposes in case one of the services fails, to provide parallel processing for computationally intensive tasks, to provide extra capacity for peak loads, as well as for many other reasons. Services in a group may communicate with each other to coordinate their activities and states. For instance in the example shown in FIG. 1 it may be advantageous to group the two sensors 12, 14.
There are two primary types of group structures: the coordinator cohort (CC) group and the peer group. In a CC group there is one distinguished member of the group, the coordinator, that processes requests from clients. The coordinator periodically updates the other services in the group, the cohorts, with information about its current state and completed requests, so that if the coordinator fails, the cohort selected to replace it will be as current as possible. The more frequent the updates, the more tightly coupled the states are between group members, and so the more likely the transition will occur without being visible to existing clients of the group. On the other hand, more frequent updates require additional computational capacity and communication bandwidth.
In a peer group, all of the members of the group process requests from a client, which itself requires some logic to decide how to use the multiple results returned from the group members. For example, if three thermometers exist in peer group, and a client requests the temperature it will receive three answers. Many options exist for using the multiple results, such as taking the first to respond, taking the average value of all the responses, or taking the highest value. A peer group is more robust and fault-tolerant than a CC group because each of the group members should always be in the correct state, and because the likelihood of the representative member (which is all members in a peer group, but only the coordinator in a CC group) being unavailable is drastically lower. However, a peer group also requires more resources, both bandwidth and computational, than a CC group because all of the group members are working and responding to each client request.
Another technique known in the existing art is leasing. A lease is an important concept throughout distributed computing, generally used between a client and service as a way for the service to indicate its availability to the client for a length of time. At the end of the lease, if the lease is not renewed, there is no guarantee of availability. In a simple example, a service may register with a look-up service and be granted a lease for five minutes. This means that the lookup service will make itself available to the service (i.e., list it) for five minutes. If a camera grants a lease to a client for two minutes, then that client will be able to position, zoom, and take pictures for two minutes. There are a wide variety of ways to handle lease negotiation, renewal and termination which are well known to those skilled in the art of distributed computing and all such methods are meant to be incorporated within the scope of the disclosed invention. A detailed explanation of leases can be found in, Jim Waldo, The Jini Specification, 2nd Edition, chapter LE (2001), which is incorporated herein by reference.
One useful aspect of leases is that they can be used for simple failure detection. If the expectation is that a client will continue to request lease renewal from a service, but then does not renew its lease, the service may assume that the client has failed, or is otherwise unavailable. This allows the service to more efficiently manage its own resources, by releasing any that were dedicated to expired clients. Such a use of leasing is described in U.S. Pat. No. 5,832,529 to Wollrath et al.
This is especially important because components only rarely plan and announce their failure and are not able to predict network outages. It is far more common that failures and outages are unexpected, and that the consequence is an inability to announce anything. In these cases, a client will not renew its lease so that eventually, the granting service will reallocate its resources. The shorter the lease period, the sooner a failure can be detected. The tradeoff is that both client and service spend proportionately more time and resources dealing with leasing and that timing anomalies may have implications for correctness.
Some benefits of distributed computing and mobile code can immediately be seen from this example. First, the clients 2, 4 in FIG. 1 do not need to know ahead of time which sensors 12, 14 are available, or even how many. They simply query the look-up service 20, which provides this information along with the necessary mobile code 12a, 14a to call the sensors. Similarly, the clients 2, 4 do not care which clock 10 is available, as long as any clock 10 is available. Again, this is because through the use of mobile code, a client 2, 4 is provided with the necessary service proxy 10a to invoke and work with the clock 10. Also, the failure or unavailability of a single sensor 12, 14 or other service is not likely to cause the entire application to stop running. Further, the processing load is distributed among a number of computing devices. Also, the various computing entities need not use the same operating system, so long as they conform to a common interface standard.
Jini is one example of a commercially available specification for a distributed object infrastructure (or middleware) for more easily writing, executing and managing object-oriented distributed applications. Jini was developed by Sun Microsystems and is based on the Java programming language; consequently, objects in a Jini system are mobile. Jini is described in Jim Waldo, The Jini Specification, 2nd Edition (2001). The Common Object Request Broker Architecture (CORBA), developed by the Object Management Group, and Distributed Component Object Module (DCOM), developed Microsoft Corporation, are two other commercially available examples that are well known in the prior art. Jini, DCOM, CORBA and a number of other distributed computing specifications are described by Benchiao Jai et al., Effortless Software Interoperability with Jini Connection Technology, Bell Labs Technical Journal, April-June 2000, pp. 88-101, which is hereby incorporated by reference.
Distributed computing systems with groups can also be found in the prior art, particularly in the academic literature. For example, Ozalp Babaoglu et al., Partitionable Group Membership: Specification and Algorithms, University of Bologna, Department of Computer Science, Technical Report UBLCS-97-1 describe groups, but assumes the services in the group are group-aware. Similarly static group proxies, or software wrappers, for clients have been described in Alberto Montresor et al., Enhancing Jini with Group Communication, University of Bologna, Department of Computer Science, Technical Report UBLCS-2000-16, but these group proxies cannot be modified during execution of the distributed application to accommodate changes in group make-up and structure.
A number of problems can be found in these and other implementations and putative descriptions of distributed applications. Chief among these is that, even if some notion of groups is available within the infrastructure, both services and clients need to be group-aware; that is they need to contain logic to interact either within and as part of a group (in the case of grouped services), or with a group (in the case of clients of a group of services). This logic is very complex and the skill set required to write such software is very different from the skills required to write the underlying client or service. Further, many existing clients and services exist that do not have group logic, and even for clients and services that are being newly written it can be challenging to write this logic as part of the module. Even if group logic is coded into new clients or services, they become locked into a particular instance and type of group and in most cases will need to be rewritten if the group architecture or makeup changes. Therefore it is desirable to develop a methodology wherein the group-aware logic for clients and services are provided in separate code modules. Existing and previously described attempts at group services have always assumed that both the services to be grouped and the clients using group services are group-aware. The assumption of group-awareness prevents existing, or legacy, software from being able to take advantage of the benefits of groups (unless they are rewritten) and burdens new applications with providing the necessary group logic to operate with the particular implementation of the group service. If wrappers were considered for grouping legacy services, they were static and hard-coded, locking the service into a single framework. Moreover, static wrappers introduce an additional, distinct point in the computation, with negative performance and, ironically, fault tolerance implications, since such solutions can never operate in the same process space. In all frameworks, group structures were static and therefore did not permit transitions between group structures.
All previous frameworks also ignored clients. Further, even if clients are written to be group-aware, they must be group-aware in the very particular way that the group of services are implemented. For example, if a client is capable of delaying its requests during membership changes to a group of services, until it receives a signal informing it that the membership change has completed, then it cannot interact with a system in which groups send no such signal, but instead expect the client to poll for this information. Therefore it would be preferable for this logic to be provided at run time when the groups are established.
A major problem with current distributed computing methodologies that support groups is that changes to the group's membership are invasive; that is, services within a group cannot be changed without temporarily halting the availability of the application. Further, in current systems, if a service, whether grouped or single, is unavailable, the client is burdened with handling this unavailability; if it does not, the client may wait indefinitely, take incorrect steps, or even crash. This is true, even in simple redundant backup systems, where the client must handle any delays caused by the switch from a primary to a backup service. Another limitation of current approaches is that group structure, CC, peer or otherwise, is not modifiable without also stopping and then restarting the application, again leaving existing clients in the lurch. Fluid group structure transitions could be used to increase or decrease quality of service properties such as load-balancing or fault tolerance, and to simplify peer group operation when the service code calls for external interaction.
Thus, it is desirable to have a distributed application in which new services can be added, or services in a group restructured, “on the fly”, that is without halting other members of the application.
It is therefore an object of this invention to provide a method for transparently managing and interacting with groups of services in a distributed application in which groups are dynamic in their membership, organizational structure, and their members' individual functionality.
It is a further object of this invention to provide a method of handling transitions in a group of services that does not burden the client.
It is a further object of this invention to provide a method for grouping services wherein a group of services can simultaneously be arranged in multiple group modes.
It is a further object of this invention to provide a method of grouping services in which the group-aware logic is provided in separate code modules from the core functional logic of the clients and services.
It is a further object of this invention to provide a method of grouping services in which the code modules that handle the group-aware logic are highly reusable from one application to the next.
It is a further object of the invention to provide for a method of grouping services where services can be added or removed, and groups restructured during operation, yet without interrupting, execution.