1. Field of the Invention
The present invention is related to optimizing storage of data in distributed databases. In particular, the present invention is related to moving data in distributed databases to storage locations that are closest to the client nodes that will perform the largest frequency of access for the related data.
2. Description of the Related Art
Distributed computing has evolved to technologies that enable multiple devices to share in providing a given service. For example, grid computing has been proposed as a model for solving substantially large computational problems using large numbers of computers arranged as clusters embedded in a distributed infrastructure. Grid computing has been described as differentiated from distributed computing through an increased focus on resource sharing, coordination, manageability, and high performance. The focus on resource sharing has been referred to as “the grid problem”, which has been defined as the set of problems associated with resource sharing among a set of individuals or groups.
One form of distributed computing involves distributed databases. Distributed databases have been attempted in a variety of implementations, where the central goal is to distribute data among multiple locations in order to provide load-balancing. Databases are composed of tables of records, where each record (e.g., row) includes a number of elements. An element, referred to as a field or column (i.e., a column of a row includes an element), is the quantum of data that is stored (as an alphanumeric value) with other elements of the record (row) based on an established relationship; hence, a record can be illustrated as a row, and each column has corresponding field of the record.
Conventional attempts in implementing distributed databases have involved database tables being split, where table records (i.e., rows) are arbitrarily split across storage device boundaries. For example, a distributed database may store records 1-50 on a first storage device and records 51-100 on a second storage device, for example as implemented as early as the 1970s using the Honeycomb Drive technology developed by IBM. Hence, the database records are split according to a physical hierarchy of the storage devices: the database server is aware of the multiple storage devices storing the respective groups of database records, and accesses the appropriate storage device to obtain the database record as required. Hence, databases were designed (i.e. structured) such that distribution of the contents of a database were split among a number of storage devices, such as the cylinder drives used by the Honeycomb Drive technology.
Another technique of splitting contents of databases, particularly relevant to relational databases, involves dividing tables and then distributing the tables as divided across different machines based on content. For example, if a database has a number of records that exceeds the capacity for a single storage device, or if the received database queries exceed the capacity of a single database server machine, the database tables may be split among multiple database server machines, where the first machine has tables associated with a first attribute (e.g., employee identifier attributes), and the second machine has tables associated with a second attribute (e.g., salary information). One of the columns will specify which storage device stores the associated data from that table, enabling a machine executing a database server to locate the data associated with the record. Hence, a SQL query can be sent to multiple machines in order to reconstruct the requested data according to the SQL query. In addition, two tables can be created separately, and then joined at runtime in response to a SQL query.
In addition, distributed database technology includes optimizing queries (e.g., relational queries) requiring joining tables from different storage devices, where the processing of the query is optimized based on different portions of the query being processed on different machines, wherein those machines transfer their results of their respective portions of the query to another machine that compiles the respective results.
However, the necessity of performing queries by accessing tables on multiple remote storage devices, and forwarding the data to different machines, can result in costly procedures in terms of network bandwidth and latency, especially in the case of high-frequency queries that are performed repeatedly to access remote data. Moreover, as data records become more distributed based on storage across a larger number of respective storage locations, global queries suffer a decrease in performance due to the database server requiring to access each and every one of the storage locations in order to complete a global query, even if the global query specifies a substantially small subset of a data record, for example selecting all records where a specified element has a specific value (or range of values) specified in the global query.