The present invention relates to distributed databases and, in particular, to determining common table definitions in distributed databases.
A distributed database is a database in which storage devices are not all attached to a common processing unit. The distributed database may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks.
A distributed federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are distributed and interconnected via a computer network and may be geographically decentralized.
Through data abstraction, federated database systems can provide a uniform user interface, enabling users and clients to store and retrieve data in multiple non-contiguous databases with a single query, even if the constituent databases are heterogeneous.
Ad-hoc dynamic distributed federated databases (DDFDs) use a “Store Locally Query Anywhere” mechanism (SLQA) which provides for global access to data from any vertex in the database network. Data is stored in local database tables at any vertex in the network and is accessible from any other vertex using structured query language (SQL) like queries and distributed stored procedure-like processing.
The database comprises a set of interconnected vertices each of which is a federated relational database management system (RDBMS) engine. By federated it is meant that the database engine is able to access internal and external sources as if it was one logical database. External sources may include other RDBMS or any other data source as flat files of data records.
Data can be stored at any vertex in the network of database vertices, with the table in which it is stored being available to other vertices through a logical table mapping which is implemented as a virtual table interface. Queries can be performed at any vertex requesting information from any other vertex in the network. The query propagates through the network and result sets are returned to the querying vertex. The vertices manage the forward routing of queries so as to minimize the number of copies of the same query. Each vertex that forwards the query is responsible for processing the results obtained from vertices to which the query was forwarded, leading to a distributed aggregation of results.
In these federated database networks, a “logical table schema definition” is a unified set of column names and associated types which various physical data source structures may be mapped to (for example, a back end RDBMS, or a spread sheet, or a flat CSV file).
As each database node in a DDFD is independent, the same logical table may be defined differently on various nodes in the network. A challenge arises in determining what definition to use for any given logical table name targeted by a DDFD query, where the query may be issued against any node in the network. The definition determines the query response structure.
For query response consistency, the elected definition would ideally be the same irrespective of the originating node. Further, as different definitions may conflict with each other (e.g., if they have ‘non-matching’ column types for the same column names), an ideal system would resolve ambiguities and apply type promotions as appropriate.
There are three reasons why logical table definitions may differ between nodes: 1) legitimately—if the owner of a certain node adds, removes or modifies certain columns, then other definitions elsewhere may be out of date or just have slightly different data to expose; 2) erroneously—a definition was incorrectly set, which may block access to local data and even disrupt query processing; and 3) maliciously—a definition was purposefully set incorrectly in an attempt to disrupt query processing.
Table definitions are considered to conflict with each other if they have “non-matching” column types for the same column names. Column types may be said to be “non-matching” if they differ in any way at all; or under a more elaborate scheme, they may be said to be “non-matching” if the local column type cannot be promoted to the type expected in the query response (for example, as indicated in the propagated definition).
A table definition matching method may be used where columns are compared by position (instead of by name). Matching by position allows for column re-naming whereas matching by name allows for column-repositioning. It is generally accepted that the latter is preferable in distributed federated databases, because it allows the same queries to be run consistently from anywhere (without having to rename columns) and because it allows nodes to omit columns that exist elsewhere in definitions for the same logical table.