In the field of computer science, a distributed system is a collection of autonomous processing nodes (e.g., physical or virtual computers) that act in concert to achieve a common computing goal or purpose. Each node of a distributed system executes a component of a software application, referred to as a distributed application, which exchanges messages with other components of the distributed application executing on other nodes. Through this message passing process, the nodes can interact with each other and coordinate their actions. Examples of well-known distributed systems include hosted service platforms (e.g., software-as-a-service, infrastructure-as-a-service, etc.), distributed databases, peer-to-peer content delivery networks, and the like.
One aspect of managing a distributed system involves upgrading the system (in other words, updating the distributed application software running on each node of the system) on a periodic basis. In environments where the availability of the distributed system is an important consideration, upgrades are typically performed using a “rolling” approach in which the nodes of the system are brought offline and updated in phases (rather than all at once). This approach ensures that there are always some live (i.e., online) nodes in the distributed system for carrying out application processing.
Because a rolling upgrade is an incremental process, there will generally be a window of time during such an upgrade where some live nodes are running the new (i.e., upgraded) version of the distributed application while other live nodes are concurrently running the old (i.e., non-upgraded) version of the distributed application. Although this version mismatch between nodes may not cause any complications if the upgrade involves minor changes to the application's internal logic, it can be problematic if the upgrade is a “data upgrade,” and thus includes changes to any of the message data formats used for inter-node communication. In the latter case, a first node executing the old version of the distributed application may not be able to decipher upgraded messages sent by a second node executing the new version of the distributed application, since the first node has not yet been updated with the appropriate application code for recognizing and parsing those upgraded messages.
To mitigate this issue, it is possible to mark certain data fields of a message as optional, implement code for automatically ignoring unknown data, and provide default values for data fields. These techniques solve some important use cases, such as adding a new data field or deleting an existing data field.
However, other types of data format changes are not as easily addressed by conventional techniques. For example, if the type or format of an existing data field in a message is modified (e.g., as an optimization to reduce overhead), the new version of the distributed application must generally write/transmit the modified message in both the old and new data formats so it can be understood by older application versions. This, in turn, leads to higher overhead and worse performance until the next system upgrade (and thus defeats the purpose of the optimization, if that was the reason for the change). Further, while this type of temporary inefficiency may be tolerable for in-house distributed systems where the application developers can make assumptions about the speed of the upgrade cycle, it is less acceptable for distributed systems that are deployed and maintained externally at customer premises (i.e., “remote” distributed systems). With remote distributed systems, upgrades that introduce such inefficiencies are generally disallowed, since it is difficult to predict when the customer will next be willing to upgrade.