A distributed database is a database in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers.
Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites.
To ensure that the distributive databases are up to date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. This process can also require a lot of time and computer resources. Duplication on the other hand is not as complicated. It basically identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, changes to the master database only are allowed. This is to ensure that local data will not be overwritten. Both of the processes can keep the data current in all distributive locations.
Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. The implementation of these technologies can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security, consistency and integrity.
Multi-version concurrency control (MCC or MVCC), in the database field of computer science, is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.
For instance, a database will implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by marking the old data as obsolete and adding the newer version. Thus there are multiple versions stored, but only one is the latest. This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old, obsolete data objects. For a document-oriented database it also allows the system to optimize documents by writing entire documents onto contiguous sections of disk—when updated, the entire document can be re-written rather than bits and pieces cut out or maintained in a linked, non-contiguous database structure.
MVCC also provides potential point in time consistent views. In fact, read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and read these versions of the data. This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained, rather than through a process of locks or mutexes. Writes affect future version but at the transaction ID that the read is working at, everything is guaranteed to be consistent because the writes are occurring at a later transaction ID.
In other words, MVCC provides each user connected to the database with a snapshot of the database for that person to work with. Any changes made will not be seen by other users of the database until the transaction has been committed.
FIG. 1 shows a known system in which transactions or changes to entries in the Resources resulting from operations of the Application, are controlled by a Transaction Manager (TM) or Transaction Co-ordinator and multiple Resource Managers (RMs). The TM co-ordinates multiple RMs into a single “global” transaction. The Application communicates with each Resource and also with the TM. This may be by way of APIs, and/or a query language, and/or a protocol. For present purposes, we shall refer to the interface between the Application and the Resources (and the TM) simply as an interface, although it will be understood that the term interface encompasses one or more of APIs, and/or a query language, and/or a protocol. The Application may be considered as a program that uses and manipulates the entries in the Resources.
Another way of looking at a known database system is shown in FIG. 2. A Resource in this context is a system or component that participates in a transaction processing system. A database may be considered as a typical Resource. The Resource in FIG. 2 may, at its simplest, be considered as a storage area that is managed by an RM. The RM is simply an interface exposed by a transacted Resource to the TM. The RM allows the TM to coordinate transaction boundaries across multiple Resources. Multiple RMs may be present in a single Resource. The RM may perform operations such as prepare( ), commit( ) and rollback( ). These operations are invoked by the TM. The TM invokes the RMs, and is only concerned with managing the prepare/commit/rollback lifecycles for the various RMs used in its transactions. An Application can access the Resource by way of APIs, and/or a query language, and/or a protocol that interfaces between the Application and the Resource, and which allows the Application to ask the Resource to do something. Operations are resource-specific, some examples include Structured Query Language (SQL) queries, or key/value API operations such as findByPrimaryKey(Key), update(key,value), remove(key), create(key,value) and so forth.
The industry has implemented a two-phase commit (2PC) protocol, and various standards (e.g. CORBA Object Transaction Service (OTS), Java® Transaction API etc.) have been put in place in relation to the 2PC protocol. In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction (it is a specialized type of consensus protocol). The protocol achieves its goal even in many cases of temporary system failure (involving either process, network node, communication, etc. failures), and is thus widely utilized. However, it is not resilient to all possible failure configurations, and in rare cases user (e.g., a system's administrator) intervention is needed to remedy outcome. To accommodate recovery from failure (automatic in most cases) the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. Many protocol variants exist that primarily differ in logging strategies and recovery mechanisms. Though usually intended to be used infrequently, recovery procedures comprise a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol.
In a “normal execution” of any single distributed transaction, i.e., when no failure occurs, which is typically the most frequent situation, the protocol comprises two phases:
i) The commit-request phase (or voting phase), in which a coordinator process attempts to prepare all the transaction's participating processes (named participants, cohorts, or workers) to take the necessary steps for either committing or aborting the transaction and to vote, either “Yes”: commit (if the transaction participant's local portion execution has ended properly), or “No”: abort (if a problem has been detected with the local portion), and
ii) The commit phase, in which, based on voting of the cohorts, the coordinator decides whether to commit (only if all have voted “Yes”) or abort the transaction (otherwise), and notifies the result to all the cohorts. The cohorts then follow with the needed actions (commit or abort) with their local transactional resources (also called recoverable resources; e.g., database data) and their respective portions in the transaction's other output (if applicable).
Referring now to FIG. 3, in a known MVCC environment, an algorithm is used for implementing multiple isolation levels, including the strictest isolation. This means that the system is serializable.
Multiple versions of each database entry are stored, each with an associated version number or timestamp. The version number or timestamp is allocated by the database or the Resource handling the transaction. In the example shown in FIG. 3, the version number is shown as “DBTimeStamp=8”. Version numbers or timestamps are typically a monotonically increasing sequence.
In FIG. 3, which shows a simple database with three entries A, B and C, it can be seen that entry A has versions 1, 2 and 3; entry B has versions 4, 5 and 8; and entry C has versions 6 and 7. Transactions observe a consistent snapshot of the contents of the database by storing the DBTimeStamp at the point of first access to the database or RM. For example, a transaction that started when DBTimeStamp had a value of 3 would only be able to see entry A version 3; a transaction that started when DBTimeStamp had a value of 5 would only be able to see entry A version 3 and entry B version 5; and a transaction that started when DBTimeStamp had a value of 7 would only be able to see entry A version 3, entry B version 5 and entry C version 7.
In other words, each entry in the database, and each change to an entry (e.g. creation, updates, removal etc.) is stored in the database along with a version number or timestamp. A transaction can only “see” the appropriate values in the database that are valid for that particular transaction. A transaction therefore has associated meta-data (a first-read timestamp) that is initialised when the transaction first reads/writes from/to the RM. The database only lets a transaction observe entries that have timestamps less than or equal to the timestamp of the transaction. This effectively confines the transaction to entries that have been made or changed prior to the timestamp of the transaction. Each time a given transaction commits, the current “global” timestamp of the RMs is moved forward, so that new transactions (started later) will see the modifications that the given transaction has made. In order to move the global timestamp forward, concurrent transactions have to use mutual exclusion to update the global timestamp.
Referring now to FIG. 4, a new transaction begins and is assigned a visibility of DBTimeStamp=8. In this example, the new transaction will update entry C, and commit. The transaction commits as follows, with operations 1), 2) and 3) being performed as a single action (i.e. the operations are not interleaved across concurrent transactions):
1) read DBTimeStamp with value 8
2) insert the new update creating C with version 8+1==9
3) update DBTimeStamp to value 9
In a local-only system it is possible to use locking to ensure correct ordering/non-interleaving. In a distributed system such as a multi-server cluster, in order to make database changes available across servers, it is usual to use an ordering to apply the changes to other nodes. In other words, all nodes apply the same changes, in the same order, using a deterministic algorithm. This means that, given the same starting state and the same changes being applied (in the same order), then each node will reach the same state. A total ordering of commit messages is used to ensure that the operations are not interleaved. Total ordering means that every server/node processes the same operations in the same order. The operations must be non-interleaved, otherwise the RM/DB or Client will result in incorrect data or states. This ordering enforces a one-after-the-other application of changes, which means that it is not possible to make use of parallelism, as it would not have any benefit.
Distribution is a key requirement for systems with availability and performance that is greater than the availability and performance of a single server. In addition, it is desirable to have a system where there are multiple active nodes so as to ensure high availability as well as scalability greater than the capacity of any single server. It is also desirable to reduce the general network chatter between nodes, and to prevent chatter from taking place mid-transaction. Accordingly, network messages are passed at the end of a transaction as part of the commit protocol.
Moreover, all nodes apply the same whole-commit in the same order (i.e. Total Ordering of Commits—TOC). This ensures that the commit operations are non-interleaved across different transactions. The application of a commit uses a deterministic algorithm, hence all nodes reach the same state.
The benefits of this known implementation are that it provides a scalable and highly available system architecture, it is active/active over N nodes, and communication takes place only at the end of a transaction, thereby reducing latency. There is, however, a significant drawback, in that a scalability bottleneck is created around commit ordering (since all commits must be executed serially).
If a certain isolation level (e.g. read-committed, or repeatable-read, or serializable, and so forth) is needed, then it is necessary to have a Resource that provides such isolation. Moreover, if multiple Resources are involved in a transaction, then there is a problem that the overall isolation will be less than the isolation of a single Resource. This is because different Resources can release internal resources (such as locks) when they commit, and the Resources commit at different times. As a result, Applications can observe changes in one Resource before they can observe changes in another Resource, leading to data corruption. This is not a particularly problematic issue if Applications or Resources always use a form of pessimistic concurrency control, but if some parts of the system use optimistic concurrency control and other parts use pessimistic concurrency control, then guarantees are lost if the entries are spread across multiple Resources.
Referring now to FIG. 5, this shows a simple system with two nodes (Server 1 and Server 2), each in the same state. Each node applies the same transactions in the same order, resulting in the same state after the transactions. In the illustrated example, the transaction is an update to entry B. In simple terms, in a distributed system where any member of a group can multicast a transaction message to any other member, certain problems can arise. One of the most significant is that messages can be interleaved. For example, if process X sends message 1 and process Y sends message 2, it is possible that some group members receive message 1 first, and others receive message 2 first. If both messages update the value of some shared data structure, then it is possible that different members will have different values for the data structure after the transaction. TOC helps to prevent this situation by forcing all messages or transactions to be accepted and processed in some fixed order. Timestamps are one way of doing this, and these allows receivers getting an out-of-sequence message to recognise it as such, and to hold the message until the preceding message has been received.
There is a useful and detailed discussion of distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing in U.S. Pat. No. 5,701,480, the full contents of which are hereby incorporated into the present application by way of reference.
U.S. Pat. No. 5,701,480 explains in detail how it is possible to define a single global serializability across multiple Resources in multi-value databases.
It is well known that global serializability is not guaranteed merely by ensuring that each processor or process achieves local serializability, because local transactions may introduce indirect conflicts between distributed global transactions. It is impractical to permit a processor or process to view a global picture of all the conflicts in all of the other processors or processes. Without a global picture, however, it is difficult for a processor or process to ensure that there is a correlation between its serializability order and the serializability orders of the other processors or processes. Time-stamping of transaction requests and data updates is one method that has been used to address this problem of concurrency control. In general, concurrency control in a distributed computing system has been achieved at the expense of restricted autonomy of the local processors or processes, or by locking.
Global serializability can be guaranteed in a distributed transaction processing system by enforcing a “commitment ordering” for all transactions. U.S. Pat. No. 5,504,900 shows that if global atomicity of transactions is achieved via an atomic commitment protocol, then a “commitment ordering” property of transaction histories is a sufficient condition for global serializability. The “commitment ordering” property occurs when the order of commitment is the same as the order of performance of conflicting component operations of transactions. Moreover, it is shown that if all of the local processes are “autonomous,” i.e. they do not share any concurrency control information beyond atomic commitment messages, then “commitment ordering” is also a necessary condition for global serializability.
However, neither U.S. Pat. No. 5,701,480 nor U.S. Pat. No. 5,504,900 addresses the issue of scalability. Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth. For example, it can refer to the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added.
Scalability is a highly significant issue in databases and networking. A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system.
An algorithm, design, networking protocol, program, or other system is said to scale, if it is suitably efficient and practical when applied to large situations (e.g. a large input data set, a large number of outputs or users, or a large number of participating nodes in the case of a distributed system). If the design or system fails when a quantity increases, it does not scale.