A distributed database is a database in which partitions of data are distributed across discrete computational resources. The distributed database may be implemented in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other networks.
Databases require users to model and load data before the data can be queried. This changed somewhat with PostgreSQL, which recently introduced the notion of foreign tables. With this change, users can now define an extension to read data from a semi-structured foreign data source. The on-line encyclopedia Wikipedia characterizes semi-structured data as a form of structured data that does not conform to the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. (See, http://en.wikipedia.org/wiki/Semi-structured_data).
Foreign table extensions transform semi-structured data into a tabular format, which can then be queried using a standard query language, such as Structured Query Language (SQL). The following general code format may be used to create a foreign table:
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [{column_name data_type [ NULL | NOT NULL ] }[, ... ][ )SERVER server_name[OPTIONS (option ‘value’ [, ... ] ) ]
This code uses the following parameters. “IF NOT EXISTS” specifies not to throw an error if a relation with the same name already exists. “table_name” is the name of the table created. “column_name” is the name of the column to be created in the new table. “data_type” is the data type of the column. “NOT NULL” indicates that the column is not allowed to contain null values. “NULL” indicates that the column is allowed to contain null values. “server_name” is the name of an existing server for the foreign table. “OPTIONS (option ‘value’ [, . . . ])” specifies options to be associated with the new foreign table. The allowed option names and values are specific to the foreign data wrapper that is associated with the foreign table. These options are validated using the foreign data wrapper's validator function.
The following foreign table declaration instantiates a table called “music”:
CREATE FOREIGN TABLE music [codechar (5) NOT NULL,titlevarchar (40) NOT NULL,didinteger NOT NULL,date_proddate,kindvarchar (10),leninterval hour to minute)SERVER music_serverOPTIONS (format ‘text’, filename ‘/home/user/music.txt’, delimiter ‘:’,null ‘’);
This table creation command simply stores a foreign table's metadata in the database, where these metadata include the column names, column types, and column options. This particular command also associates a text file on disk with the newly created foreign table, and specifies the options needed to parse the data in this text file.
Once the music table is instantiated, the text file's contents can be queried without actually loading data into the database. Rather, the data persistently remains in its semi-structured format. The foreign table is merely instantiated to support SQL queries.
Thus, one can define an extension for Apache log files, write an extension to parse the log files at query-time, and can run queries on the data. Users can also write extensions to access data on the Internet or other databases.
The problem is that users can only associate one foreign data source with one table. So if the user receives 24 (hourly) Apache log files per day, then the user needs to create 24 foreign tables. Thus, query processing becomes difficult. Further, dozens of web servers may be generating Apache log files in parallel; and processing these log files typically requires resources that extend beyond the capabilities of a single database server.
In view of the foregoing, it would be desirable to provide efficient techniques for operating a distributed database with foreign tables. More particularly, it would be desirable to provide techniques for query language queries of semi-structured data in a distributed database.