Many businesses are demanding faster, less expensive, and more reliable computing platforms. Brokerage houses, credit card processors, telecommunications firms, as well as banks are a few examples of organizations that require tremendous computing power to handle a countless number of small independent transactions. Currently, organizations that require these systems operate and maintain substantial servers. Further, the cost associated with these machines stems not only from the significant initial capital investment, but the continuing expense of a sizeable labor force dedicated to maintenance.
When it comes to mission-critical computing, businesses and other organizations face increasing pressure to do more with less. On one hand, they must manage larger transaction volumes, larger user populations, and larger data sets. They must do all of this in an environment that demands a renewed appreciation for the importance of reliability, fault tolerance, and disaster recovery. On the other hand, they must satisfy these growing requirements in a world of constrained resources. It is no longer an option to just throw large amounts of expensive hardware, and armies of expensive people, at problems. The challenge businesses face is that, when it comes to platforms for mission-critical computing, the world is fragmented. Different platforms are designed to satisfy different sets of requirements. As a result, businesses must choose between, and trade off, equally important factors.
Currently, when it comes to developing, deploying, and executing mission-critical applications, businesses and other organizations can choose between five alternative platforms. These are mainframes, high-availability computers, UNIX-based servers, distributed supercomputers, and PC's. Each of these approaches has strengths and weaknesses, advantages and disadvantages.
The first, and oldest, solution to the problem of mission-critical computing was the mainframe. Mainframes dominated the early days of computing because they delivered both availability and predictability. Mainframes deliver availability because they are located in extremely controlled physical environments and are supported by large cadres of dedicated, highly-trained people. This helps to ensure they do not fall victim to certain types of problems. However, because they are typically single-box machines, mainframes remain vulnerable to single-point failures. Mainframes deliver predictability because it is possible to monitor the execution and completion of processes and transactions and restart any that fail. However, the limitation of mainframes is that all monitoring code must be understood, written, and/or maintained by the application developer. The problem mainframes run into is that such systems fall short when it comes to three factors of high importance to businesses. First, mainframes tend not to offer high degrees of scalability. The only way to significantly increase the capability of such a system is to buy a new one. Second, because of their demanding nature, mainframes rely on armies of highly-trained support personnel and custom hardware. As a result, mainframes typically are neither affordable nor maintainable.
Developed to address the limitations and vulnerabilities of mainframes, high-availability computers are able to offer levels of availability and predictability that are equivalent to, and often superior to, mainframes. High-availability computers deliver availability because they use hardware or software-based approaches to ensure high levels of survivability. However, this availability is only relative because such systems are typically made up of a limited number of components. High-availability computers also deliver predictability because they offer transaction processing and monitoring capabilities. However, as with mainframes, that monitoring code must be understood, written, and/or maintained by the application developer. The problem with high-availability computers is that have many of the same shortcomings as mainframes. That means that they fall short when it comes to delivering scalability, affordability, and maintainability. First, they are largely designed to function as single-box systems and thus offer only limited levels of scalability. Second, because they are built using custom components, high-availability computers tend not to be either affordable or maintainable.
UNIX-based servers are scalable, available, and predictable but are expensive both to acquire and to maintain. Distributed supercomputers, while delivering significant degrees of scalability and affordability, fall short when it comes to availability. PC's are both affordable and maintainable, but do not meet the needs of businesses and other organizations when it comes to scalability, availability, and predictability. The 1990s saw the rise of the UNIX-based server as an alternative to mainframes and high-availability computers. These systems have grown in popularity because, in addition to delivering availability and predictability, they also deliver significant levels of scalability. UNIX-based servers deliver degrees of scalability because it is possible to add new machines to a cluster and receive increases in processing power. They also deliver availability because they are typically implemented as clusters and thus can survive the failure of any individual node. Finally, UNIX-based servers deliver some degree of predictability. However, developing this functionality can require significant amounts of custom development work.
One problem that UNIX-based servers run into, and the thing that has limited their adoption, is that this functionality comes at a steep price. Because they must be developed and maintained by people with highly specialized skills, they fall short when it comes to affordability and maintainability. For one thing, while it is theoretically possible to build a UNIX-based server using inexpensive machines, most are still implemented using small numbers of very expensive boxes. This makes upgrading a UNIX-based server an expensive and time-consuming process that must be performed by highly-skilled (and scarce) experts. Another limitation of UNIX-based servers is that developing applications for them typically requires a significant amount of effort. This requires application developers to be experts in both the UNIX environment and the domain at hand. Needless to say, such people can be hard to find and are typically quite expensive. Finally, setting up, expanding, and maintaining a UNIX-based server requires a significant amount of effort on the part of a person intimately familiar with the workings of the operating system. This reflects the fact that most were developed in the world of academia (where graduate students are plentiful). However, this can create significant issues for organizations that do not have such plentiful supplies of cheap, highly-skilled labor.
A recent development in the world of mission-critical computing is the distributed supercomputer (also known as a Network of Workstations or “NOW”). A distributed supercomputer is a computer that works by breaking large problems up into a set of smaller ones that can be spread across many small computers, solved independently, and then brought back together. Distributed supercomputers were created by academic and research institutions to harness the power of idle PC and other computing resources. This model was then adapted to the business world, with the goal being to make use of underused desktop computing resources. The most famous distributed supercomputing application was created by the Seti@Home project. Distributed supercomputers have grown in popularity because they offer both scalability and affordability. Distributed supercomputers deliver some degree of scalability because adding an additional resource to the pool usually yields a linear increase in processing power. However that scalability is limited by the fact that communication with each node takes place over the common organizational network and can become bogged down. Distributed supercomputers are also relatively more affordable than other alternatives because they take advantage of existing processing resources, be they servers or desktop PC's.
One problem distributed supercomputers run into is that they fall short when it comes to availability, predictability, and maintainability. Distributed supercomputers have problems delivering availability and predictability because they are typically designed to take advantage of non-dedicated resources. The problem is that it is impossible to deliver availability and predictability when someone else has primary control of the resource and your application is simply completing its work when it gets the chance. This makes distributed supercomputers appropriate for some forms of off-peak processing but not for time-sensitive or mission-critical computing. Finally, setting up, expanding, and maintaining a distributed supercomputer also requires a significant amount of effort because they tend to offer more of a set of concepts than a set of tools. As a result, they require significant amounts of custom coding. Again, this reflects the fact that most were developed in the world of academia where highly trained labor is both cheap and plentiful.
PC's are another option for creating mission-critical applications. PC's have two clear advantages relative to other solutions. First, PC's are highly affordable. The relentless progress of Moore's law means that increasingly powerful PC's can be acquired for lower and lower prices. The other advantage of PC's is that prices have fallen to such a degree that many people have begun to regard PC's as disposable. Given how fast the technology is progressing, in many cases it makes more sense to replace a PC than to repair it. Of course, the problem with PC's is that they do not satisfy the needs of businesses and other organizations when it comes to scalability, availability, and predictability. First, because PC's were designed to operate as stand-alone machines, they are not inherently scalable. Instead, the only way to allow them to scale is to link them together into clusters. That can be a very time-consuming process. Second, PC's, because they were designed for use by individuals, were not designed to deliver high levels of availability. As a result, the only way to make a single PC highly available is through the use of expensive, custom components. Finally, PC's were not designed to handle transaction processing and thus do not have any provisions for delivering predictability. The only way to deliver this functionality is to implement it using the operating system or an application server. The result is that few organizations even consider using PC's for mission-critical computing.
In a dynamic environment, it is important to be able to find available services. Service Location Protocol, RFC 2165, June 1997, provides one such mechanism. The Service Location Protocol provides a scalable framework for the discovery and selection of network services. Using this protocol, computers using the Internet no longer need so much static configuration of network services for network based applications. This is especially important as computers become more portable, and users less tolerant or able to fulfill the demands of network system administration. The basic operation in Service Location is that a client attempts to discover the location of a Service. In smaller installations, each service will be configured to respond individually to each client. In larger installations, services will register their services with one or more Directory Agents, and clients will contact the Directory Agent to fulfill requests for Service Location information. Clients may discover the whereabouts of a Directory Agent by preconfiguration, DHCP, or by issuing queries to the Directory Agent Discovery multicast address.
The following describes the operations a User Agent would employ to find services on the site's network. The User Agent needs no configuration to begin network interaction. The User Agent can acquire information to construct predicates which describe the services that match the user's needs. The User Agent may build on the information received in earlier network requests to find the Service Agents advertising service information.
A User Agent will operate two ways. First, if the User Agent has already obtained the location of a Directory Agent, the User Agent will unicast a request to it in order to resolve a particular request. The Directory Agent will unicast a reply to the User Agent. The User Agent will retry a request to a Directory Agent until it gets a reply, so if the Directory Agent cannot service the request (say it has no information) it must return an response with zero values, possibly with an error code set.
Second, if the User Agent does not have knowledge of a Directory Agent or if there are no Directory Agents available on the site network, a second mode of discovery may be used. The User Agent multicasts a request to the service-specific multicast address, to which the service it wishes to locate will respond. All the Service Agents which are listening to this multicast address will respond, provided they can satisfy the User Agent's request. A similar mechanism is used for Directory Agent discovery. Service Agents which have no information for the User Agent MUST NOT respond.
While the multicast/convergence model may be important for discovering services (such as Directory Agents) it is the exception rather than the rule. Once a User Agent knows of the location of a Directory Agent, it will use a unicast request/response transaction. The Service Agent SHOULD listen for multicast requests on the service-specific multicast address, and MUST register with an available Directory Agent. This Directory Agent will resolve requests from User Agents which are unicasted using TCP or UDP. This means that a Directory Agent must first be discovered, using DHCP, the DA Discovery Multicast address, the multicast mechanism described above, or manual configuration. If the service is to become unavailable, it should be deregistered with the Directory Agent. The Directory Agent responds with an acknowledgment to either a registration or deregistration. Service Registrations include a lifetime, and will eventually expire. Service Registrations need to be refreshed by the Service Agent before their Lifetime runs out. If need be, Service Agents can advertise signed URLs to prove that they are authorized to provide the service.
New mechanisms for computing are desired, especially those which may provide a reliable computing framework and platform, including, but not limited to those which might produce improved levels of performance and reliability at a much lower cost than that of other solutions.
In addition to performing computations or transactions, many applications require reliable storage of information for some period of time beyond that required for the processing of the computation or transaction. Examples of such applications and/or there data are email, search engines, news feeds, databases, inventory, transaction records, databases, file systems, images. The storage requirements of these applications might be small to very large, be static or dynamic in size. Prior practical data storage use the Reed-Solomon code for protecting stored information; however, the computation overhead of using the Reed-Solomon code, however, is large. Thus, practical storage systems seldom use a general (n, k) Maximum Distance Separable code, except for full replication or mirroring (which is an (n,1)), striping without redundancy (corresponding to (n, n)) or single parity (which is (n, n−1)). The advantages of using (n, k) are hence very limited if not totally lost.