The approaches described in this section are approaches that are known to the inventors and could be pursued. They are not necessarily approaches that have been pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section, or that those approaches are known to a person of ordinary skill in the art.
In computer software development, source code management systems (also called revision control or version control systems) are used to track and manage computer program source code as the code is written and revised. For readability, the acronym “SCM” will be used in place of “source code management”, and although SCM systems are predominantly used to track source code they can be used to track other data.
Examples of SCM systems include systems such as MERCURIAL, GIT, and BAZAAR. Generally speaking, SCM systems store data—typically source code—in repositories and facilitate access to that data from multiple different client systems. In order to work on a project, a user (using a client system) creates a local copy of the relevant data (e.g. program source code) from a repository and works on that local copy. If the user makes changes that are to be incorporated into the remote version of the data, the user's local copy of the data—or at least those portions that have been changed—is written back to the repository using the SCM system. The SCM system controls access to the repository data and also manages version control for the data.
A clustered SCM system is a SCM system that appears to users as a single SCM system, storing data in repositories and allowing data to be copied locally and changes to be written back, but is actually implemented on multiple physical compute nodes connected by a fast network. By dividing workload between compute nodes in the cluster in this way, a clustered SCM system can serve a larger capacity of users simultaneously than can an unclustered SCM system implemented on a single physical compute node before encountering resource constraints and degradation of performance.
Various approaches to making repository data available to clients in clustered SCM systems are used.
For example, in one approach, data repositories managed by a SCM system are sharded over multiple file server nodes. SCM processes to read/write/manage data in a given repository are executed remotely via a Secure Shell (SSH) connection on the file server node that stores the relevant repository data. In this approach each server node must specialize as a “front end” or “file server” node for the data it stores. This impacts on the ability to balance workloads, as every SCM process must run on the same file server node that stores the relevant data. If workloads are such that certain data is in high demand, the file server(s) serving that data can come under much greater load than other file servers. Furthermore, sharding SCM repositories over multiple file server nodes can complicate backup/maintenance procedures compared, for example, to where a single file server is used. In addition, where repositories are sharded over multiple file servers any operation/request that involves accessing more than one repository becomes difficult if not impossible.
An alternative approach to sharding repositories over multiple file servers is to replicate the entire contents of all SCM repositories managed by the SCM system on all nodes. This approach, however, introduces complexities (and data processing/communications overheads) in synchronizing repository data between nodes. When a client connected to a particular node makes changes to data those changes must be propagated through to all other nodes. This becomes even more complicated where different clients connected to different nodes make different/divergent changes to different copies of the same data, and those changes need to be reconciled. In addition, in order to replicate all SCM repositories on all nodes each node must have sufficient storage capacity to do so, increasing the storage requirements.
Yet another approach is to store data repositories managed by the SCM system on a single file server. Storing repositories on a single file server simplifies back up operations and removes the requirement to reconcile data across different copies of the repositories. In a single file server approach, however, any node needing to access a repository does so by connecting to the same single file server. This has performance implications.