1. Field of the Invention
This invention generally relates to database management systems. More specifically this invention relates a method and apparatus for implementing a multi-user, elastic, on-demand, distributed relational database management system characterized atomicity, performance and scalability.
2. Description of Related Art
Over the past years the use of databases for storing and retrieving messages has emerged as an important tool in a wide variety of commercial applications. Initially, many database systems operated on a single server installation with multiple users. However, various factors have developed over the past years that have required the basic nature of database architecture to change. As a first factor, database storage requirements have become extremely large. Second, the number of users trying to access such databases has also become large. Third, the use of databases to retrieve relatively stable data with minimal updates has been replaced by transactional processing.
A transaction is a unit of work must be completed in its entirety. A single transaction may include multiple data manipulations. As an example a single transaction may include a reading operation followed by a write operation. In recent years significant effort has been directed to enabling relational databases to support ever increasing rates of transaction processing.
Databases are now judged by a standard that defines ACID properties, namely: atomicity, consistency, isolation and durability. Atomicity guarantees that all transaction tasks will be completed in their entireties. Consistency assures is that only valid data is written to the database. Isolation assures that other operations cannot access or “see” data in an intermediate state during a transaction. Durability assures that once a transaction has been processed successfully, it cannot be undone.
Consistency is particularly important in multi-user systems where it is possible for two or more users to seek concurrent access to shared volatile data. Early multi-user systems used locking operations to assure consistency. Locks could be exclusive, or write, locks, or non-exclusive, or read, locks and could be applied to individual records or to pages. However, as databases have grown in size and as transaction rates have increased, the overhead for managing locks has become significant and, in some cases, prohibitive.
Multi-version concurrency control (MVCC) is an alternative process for assuring concurrency. MVCC can be more effective than locks with complex databases. MVCC uses timestamps or increasing transaction identifications (IDs) to serialize different versions of a record. Each version permits a transaction to read the most recent version of an object which precedes the timestamp or ID. With this control method, any change to a record, for example, will not be seen by other users until the change is committed. MVCC also eliminates locks with other attendant overhead and establishes a system in which read operations can not block write operations.
In addition to meeting the ACID tests, there now is a requirement for continuous availability to users. Some database systems dedicate one computer system to transaction processing and another to decision support and other reporting processes. They are interconnected so that other functions can be supported simultaneously. As databases grow in size and complexity, existing data processing systems are replaced by more powerful data processing system. Another approach for accommodating growth involves replicated systems where one machine is designated as a “head” machine that keeps all the replicated machines in synchronism. If a head machine were to fail, a process would assign that function to another replicated machine. Different replicated machines are available to certain users. This approach is not scalable because all the machines have to have the same capability.
As another approach, multiple autonomous database systems can be integrated into a single “federated” database with a computer network interconnecting the various individual databases. Federated databases require “middleware” to maintain the constituent databases in synchronism. This “middleware” can become very complex. As the database size increases, the resources required for operating the middleware may impose such a sufficiently great overhead that overall system performance degrades.
“Partitioning” is another approach for implementing databases in which a logical database or its constituent elements are divided into distinct independent parts. In a distributed database management system, each partition may be spread over multiple nodes. Users at given node can perform local transactions on the partition. Partitioning also can be implemented by forming smaller databases or by splitting selected elements of just one table.
There are two general approaches to partitioning. In horizontal partitioning, also called “sharding”, different rows are placed in different tables and different servers. Generally they have a certain commonality such as a range of zip codes or last names which are divided into different tables by ranges. For example a first database might contain all the records for last names in the range A through M; a second database, in the range N through Z. Sharding, which is a form of horizontal partitioning, involves locating rows of a database on separate servers. Sharding does reduce the number of rows in each table and increases search performance. However, sharding uses a hash code at an application level that makes it much more difficult to implement. It also incorporates a two-phase commit. The complexities of sharding make it suitable for particular applications as the basis for defining the shards is quite well defined.
Vertical partitioning involves the creation of tables with fewer columns and splitting columns across tables. Like a federated database, vertical partitioning requires middleware to determine how to route any request for a particular field to an appropriate partition. In addition these systems operate using a two-phase commit sequence which is complicated to implement.
In still another approach, known as a “shared-nothing” architecture, each node is independent and self-sufficient. Shared-nothing architecture is popular for web development because it can be scaled upward simply by adding nodes in the form of inexpensive computers. This approach is popular in data warehousing applications where updates tend to be less frequent than would occur with transaction processing. However, the processing of joins is very complex over large data sets from different partitions or machines.
Some database systems are referred to as “distributed” systems. One implementation of a distributed system incorporates “clusters” and two communications paths. A high-speed Internet path carries data among the clusters. High-speed dedicated communications paths are required for various control functions, such as lock management. While this approach solves the redundancy and availability issues for databases, lock management, as previously discussed, can limit system performance.
In a “shared everything” system, super high-speed communications keep the system in synchronism. However lock management can require significant bandwidth resources. To avoid this, such systems incorporate point-to-point communications channels and a very sophisticated disk controller.
Collectively, those prior art systems satisfy some but not all of the known requirements for a database system. What is needed is a database architecture that is scalable, that meets the ACID properties of atomicity, consistency, isolation and durability. What is also needed is a database system that operates over the Internet without the need for dedicated high-speed communications paths, that provides transaction processing and that is operable over a wide geographic area.