1. Field of the Invention
The present invention relates to large-scale databases, and more particularly, to a database architecture having incremental scalability and that is adapted for use with Internet database systems.
2. Background Information
The amount of data generated by a typical Internet website is tremendous. There is a need for Internet applications that can store, manipulate, and retrieve large amounts of data. For example, a typical e-commerce website maintains information for each user, such as shipping and billing data, previous shopping experiences and category preferences. Popular websites may have millions of these data records. The explosive growth of Internet data is due to two primary factors. First, as the Internet expands, its reach becomes more pervasive, as more and more users are going online. Second, as Internet applications become more dynamic and personalized, more data are stored about each user. Therefore, data storage solutions become critical pieces of the Internet infrastructure requirements.
The term xe2x80x9cnetstorexe2x80x9d as used herein is defined to be an Internet-scale data store that can handle both the traffic and capacity required by an Internet application. The netstore must have several capabilities. First, the typical number of total users that can access the netstore is extremely large (e.g., greater than 100 million users). Additionally, the typical number of concurrent users is large (e.g., 1 million users). Read operations to the netstore are more prevalent than write operations (e.g., a 10-1 read-to-write ratio for some Internet applications or even 100-1 for others). The netstore must be able to store a large amount of data and should be simple and flexible. Additionally, the data stored therein can be treated as a collection of picks that only has meaning to the particular Internet application.
Traditionally, data storage architectures for Internet applications, such as those that implement netstores, have been built upon relational and object-oriented database management systems (DBMS). These products have been developed primarily for the enterprise domain. However, it has been found that the data handling requirements of the Internet domain are significantly different than the requirements for a typical enterprise domain. Not only does the Internet domain place new demands on a netstore in terms of scalability, reliability and flexibility, the data model itself has changed. Most of these Internet applications require a very simple data model where the need to manage complex interrelationships in the data is deemphasized. Emphasis is instead placed on simplicity and flexibility of the data model. For instance, many Internet applications require the ability to read, write, or modify a single small data record individually.
Current DBMS products are not well suited for Internet applications because they have not been designed to address the distinct problem space presented by Internet applications. Consequently, solutions built using the enterprise DBMS products to address these Internet problems are costly to design, deploy and maintain.
Most of today""s Internet sites that have read/write/modify storage requirements use relational database management systems (RDBMS). The reason why these sites choose RDBMS software is primarily one of convenience. There is an abundance of software tools that provide access to RDBMS products from web and application servers, thereby enabling sites to implement their netstores using off-the-shelf software.
In order to create a netstore with an RDBMS, the site must perform the following tasks:
(a) Design the database (i.e., tables, schema, relations, keys, stored procedures, etc.)
(b) Install, tune and maintain the database servers.
(c) Architect a scalable database system that is reliable, fault-tolerant and can handle the load and data required.
(d) Database-enable the web pages through a dynamic web server model. Typical options on Windows NT include: ASP/ADO (scripting) or ISAPI/ODBC (C/C++code). Typical options on Unix include: CGI-BIN/ODBC or NSAPI/ODBC/JDBC.
(e) Database-enable the application servers through custom code such as ODBC or JDBC.
Given the problem domain of enterprise-level database systems, limitations generally arise when they are used in a netstore implementation, since they are designed to be efficient at handling related data, and are not easily scalable. Key limitations with relational database systems used in a netstore environment include high maintenance costs, insufficient performance, poor scalability, and high implementation complexity.
It is therefore desired to provide a scheme that addresses the Internet application space directly through use of a specialized solution that provides a more optimized performance than conventional approaches, such as RDBMS. Preferably, the solution should be highly reliable, highly scaleable, and provide easy migration from existing products.
The present invention addresses the foregoing desires by providing an incrementally-scalable database system and method. The system architecture implements a netstore as a set of cooperating server machines. This set is divided into clusters, each of which consists of one or more server machines. All machines within a cluster are replicas of one another and store the same data records. The data is partitioned among the clusters, so that each data record in the netstore is stored in exactly one cluster.
This architecture allows for incremental scalability, load balancing, and reliability despite hardware or software failures. The system architecture enables database capacity to be scaled by adding resources, such as additional servers, without requiring that the system be taken offline. Such scaling includes both adding one or more computer servers to a given server cluster, which enables an increase in database read transaction throughput, and adding one or more server clusters to the system configuration, which provides for increased read and write transaction throughput.
The system also provides for load balancing read transactions across each server cluster, and load balancing write transactions across a plurality of server clusters. Read transactions can be served by different replicas at the same time, spreading out the load. For example, if there are 3 servers in a server cluster, approximately ⅓ of the requests will be routed to each machine, allowing for nearly 3 times the potential read transaction throughput of a single server. Since write requests are routed to a single cluster, adding clusters spreads out the write transaction load, with a similar effect on write throughput.
The system also provides for very high availability (HA) through its use of clustering. Because each of the machines in a server cluster is an identical replica of every other machine in the cluster, if that server fails, the problem is masked from the applications. The failed machine is removed from the system and the other replica servers in the cluster are available to satisfy requests for the failed server, without any impact to the application.
A system implementing the invention includes an application server layer, comprising one or more computers, serving as clients of a data storage layer, comprising of one or more server computers. The application server layer comprises compute servers that host an application program such as a web server. Also included is a scalable database server layer comprising of one or more server clusters, wherein each server cluster includes one or more database servers. Data is stored on the computer servers in the server clusters, wherein the data on each computer server in a given cluster is replicated. Under a typical configuration, the database(s) will comprise an RDBMS database such as a SQL-based database that comprises a plurality of record objects stored in tables defined by the database schema. The table data are partitioned into fragments and distributed across the server clusters such that each server cluster stores approximately an equal amount of record objects. The database server layer also includes a configuration management component that provides other components in the system with up-to-date information about the present configuration of the database server layer. This configuration information includes mapping information (known as the fragment map) that identifies on which server clusters various record objects are stored. The architecture also includes an intermediate xe2x80x9cvirtual transactionxe2x80x9d layer disposed between and in communication with the application server layer and the database server layer that comprises of one or more computers. A database update/distributor transaction module running on each of the computers in the virtual transaction layer coordinates write transactions in a strongly consistent fashion, such that all replicas appear to process a single change simultaneously and instantaneously. This virtual transaction layer also enables load balancing of database write transactions such that write transactions are evenly distributed across the various server clusters in the system.
According to other aspects of the architecture, an application program interface (API) is provided that enables application programs to perform transactions on record objects in the database and other database interactions, such as creating/deleting tables, etc., whereby the application program does not need to know where (i.e., on which server cluster) the record objects are stored or need to implement the native interface language of the database. For example, many RDBMS databases implement variations of the SQL language for manipulation of record objects. The API also includes configuration information that is dynamically updated in accord with changes to the database server layer (e.g., the addition of new computer servers or a new server cluster), which enables application programs to perform read transactions on record objects in the database(s) in a manner that provides load balancing of such transactions.
The architecture provides for incremental scaling of a database server system, whereby read transaction throughput can be increased by adding additional servers to one or more server clusters, and write and read transaction throughput can be increased by adding one or more additional server clusters. Each server cluster stores a percentage of all data being stored (approximately defined by 1/number of server clusters), wherein a duplicate copy of the partition of data is stored on each of the computer servers in the cluster. The partitions of data include both object records and database schema data, including database tables and associated indices and stored procedures. Record objects are distributed across the server clusters based on fragments they are assigned to. Preferably, the record objects are assigned to fragments based on a hashing function. As discussed above, data corresponding to the configuration of the database server layer is maintained such that the system is knowledgeable about where the data is stored, and read and write transaction load balancing is provided.
According to further aspects of the method, the system can be incrementally scaled to improve write and read transaction throughput by adding another server cluster to the system. This comprises adding one or more new servers, creating applicable database resources on the new servers (i.e., database tables, associated indices, stored procedures, etc.), and migrating a portion of the data stored on one or more of the other server clusters to the new server cluster. During data migration, record objects are shipped to the new cluster using either on an individual fragment or a range of fragments basis, such that database transactions can continue to occur while the migration is taking place.
According to yet another aspect of the method, the system can be incrementally scaled to improve read transaction throughput by adding one or more computer servers to a given cluster. As discussed above, the system provides load balancing across each cluster such that read transactions are evenly distributed across all of the computer servers in a given cluster. Since each computer server maintains identical data, adding another computer server to a cluster provides a new resource for facilitating read transactions. Accordingly, this aspect of the method comprises adding a new computer server to a server cluster, creating relevant database objects (tables, stored procedures, etc.) on the new computer server, and copying record objects from one or more other computer servers in the cluster to the new computer server.
As a result of the foregoing schemes, the database server system can be incrementally scaled without having to take the system down, and without having to re-architect the system. Notably, such configuration changes are handled internally by the system such that there are no changes required to application programs that use the system to access data stored in the database server layer.