As computer applications and systems have become more complicated, the number of individual components that make up an application or system have increased dramatically. As the number and complexity of individual components increases, there arises a need to run these components across multiple computers for various reasons, including distributing processing power, increasing reliability (by having multiple instances of a given component running in different physical locations), and allowing access into the system by various users and external systems on multiple computers and in geographically dispersed locations.
As the number of components and computers running those components in a system increases the ability to add new components, reliably improve existing components, and effectively manage an operational system becomes increasingly harder. In many complex systems various aspects of a system may have been written by numerous programmers, with a commensurate number of differing communication paradigms between various components of the system. Adding a new component into the system that must communicate with other existing components, or simply extending the capabilities of an existing component, often means a programmer must discover all the communication paradigms and data formats in use in the existing system, code against multiple schemes in use, and rely on a generally poorly understood view of the system as background for how the new or improved component will operate in the existing system. The effects of the changes are not often well understood, and quite often their impact is not even recognized until the changes fail after being placed into a live system.
Presently, numerous techniques exist for communicating between components within a system, including the following subset of generally established methodologies: database storage, publish/subscribe messaging, and remote procedure call (or request/reply) frameworks. A high-performance, widely distributed system, will likely have some form of all three of these communication methodologies in place, often in redundant forms (multiple publish/subscribe messaging layers, for example, may coexist in a single system).
Relational databases allow for the persisted storage of large volumes of regular data. Typically, data are stored in tables, with each table made up of a well-defined row/column format. Each row in the table is a group of related data (such as a mailing address), and each column in the table is a property of the group of data (such as the street name of the mailing address). Databases typically provide complex querying capabilities, allowing a user to write a set of filtering criteria that retrieve data from the database that matches the given filter. Multiple applications may connect to a single database simultaneously; most databases provide mechanisms to ensure that two applications writing the same rows of data do not conflict with each other and see consistent results with each other. This mechanism allows a database to provide for sharing of data between applications, as one application may write a database row that another application may later query. Using a database for sharing of data generally requires each application interested in receiving data from another application to continuously query the database to see if new data has been inserted into a table. Because of this, a database is generally referred to as a “pull” based technology that requires a consumer of data to actively query the database.
Publish/subscribe messaging allows for more efficient high-frequency communication between components within a system. Generally, publish/subscribe messaging works by producers of data (publishers) generating messages that are sent out over the messaging layer. These messages often consist of property/value pairs that contain the name of a property of the data and its corresponding value. Consumers of data, or subscribers, register a subscription indicating an interest in a given set of data, often by writing criteria that indicate fields that must carry certain values. If an incoming message from a producer has the fields the subscriber is interested in, with the correct values, the subscriber will receive the message and can act on it accordingly. Typically, publish/subscribe data is not persisted over time by a messaging layer like data in a database. If an application registers an interest in a set of published data, it will only receive new messages meeting that criteria. Unlike a database, past data cannot be retrieved through a query mechanism. Publish/subscribe can send a message from a single producer to a single consumer or from a producer to a large number of consumers. Often times, a publish/subscribe messaging system is optimized such that only one message is sent on a network from a publisher to each subscriber that has registered an interest in that message. Because publish/subscribe allows a publisher to send data directly to interested subscribers, it is often referred to as a “push” technology that allows consumers to passively listen to data they are interested in. Publish/subscribe technologies are used when a producer of data has a set of updating data that a large number of consumers might be interested in, homogeneously, each time the data changes.
Remote procedure call or request/reply infrastructures provide a direct application-to-application communication mechanism like publish/subscribe, but are pull-based technologies. A consumer interested in a set of data may send a message to a producer that can provide that data, which in turn will reply with a message containing the relevant data. These technologies are often implemented when computing a set of data can only happen based on parameterized inputs known only by the consumer of the data, when the data production is a very expensive operation, when a set of data will be infrequently asked for, or when the sheer number of consumers interested in varying sets of data makes calculating all of the data continuously prohibitively expensive. Instead of continuously generating output data, the producer only must generate its output data when requested by a consumer.
In a typical, large-scale, distributed environment, all three technologies are required. A database is needed to store data that must be accessible across time, and which is able to be backed up in case of failure. Databases can be too large to completely fit within a single application at one time. A publish/subscribe mechanism is needed to broadcast rapidly changing data that many applications within the system are interested in receiving (“hearing”), in real-time, without requiring the overhead and delay of every application constantly querying an inherently slow database technology. A request/reply mechanism is needed to distribute calculations within an application across different machines or to provide shared and abstracted server resource capabilities to a variety of applications. For example, a single server may provide a price calculation service through a request/reply mechanism to user-facing graphical interface applications, to a web server, and to stand-alone processing servers.
Missing from the art is a unifying data system where data is persisted over time across a distributed computer system. The present invention can satisfy these and other needs.