The present invention relates to the field of computers, in general, and to data storage and retrieval systems, in particular.
Database manager programs are known which facilitate the storage and retrieval of data stored on a record medium, such as a Winchester hard disk or a 51/4 inch diskette, in a digital computer system. Particularly as the size of computers and their associated peripheral storage devices is reduced, it is desirable to make efficient use of the available storage area.
The data format in a database normally includes a plurality of data fields of various sizes. The size of each field is set equal to the number of bytes in an item of data for a fixed length item, or equal to the maximum length of an item of data for a variable length item. In cases of variable length data items, or where an item is not always present for a particular field, the data can be compressed by reducing the size of the fields to accommodate only the items of data actually present.
Thus, for example, in an employee record which comprises successive fields for the employee's name, social security number, birthdate, etc., the name field will be of variable length whereas the social security number and birthdate fields will be of fixed length. Of the latter two fields, it may always be necessary to enter a social security number in its respective field; however, it may not be necessary to include the birthdate. Consequently, the data in these respective fields may take the following form:
JOHN E DOE/987654321/012345 PA1 MARY J POPPINS/123456789/000000 PA1 (1) a plurality of data ("D") records, each D record being of variable length and including items for a particular row arranged serially in succession, thereby forming a plurality of data rows and columns with each item of data in each D record arranged in a separate column, there being at least one D record per row of data with a constant number of D records in each row of the table; and PA1 (2) a plurality of column descriptor ("C") records, each C record being associated with one table column and specifying: PA1 (1) Column data within data portions of data records; and PA1 (2) Data inside additional application-specific records.
The first or name field must be long enough to accommodate a name of any length; the second field must be exactly nine characters long and the third field six characters long. In some cases, no data is entered into the birthdate field, as indicated in the example above by the 0's.
When these items of data arranged in these fields are stored, it is desirable to store only the actual useful data. To accomplish this, the database manager must vary the size of the fields in storage so that they are no larger than the actual data requires. Depending upon the data format in storage, this may be a simple, or a very difficult task.
One data format which does not lend itself to straightforward data compression is a so-called "relational database table". A "relational table" is a two dimensional array of data items; that is, an array of data items which form rows and columns. Each row in the table comprises one or more records of information. Each data record comprises one or more fields in which are located the items of data. The fields may have different attributes, such as a social security number, birthdate, etc., or a group of fields may have the same attribute, such as a group of statistical samples for a given parameter. A column in the table is the same field in all of the records. All data items for a column must be of the identical data type (integer, floating point number, fixed length character, variable length graphic, etc.).
Consequently, any specific data item in the table belongs to exactly one row and one column, and can be located as such.
A typical use of a relational table is to describe a collection of homogeneous data objects. A row represents a single instance of an object and the columns are descriptive data attributes about the object. For example, a table of data describing employees could have a row defined for each employee with column values for name, social security number, birthdate, etc.
A "column" in a relational table may therefore be defined as one unit of the vertical dimension of the table. A column normally has a specific data attribute which applies to all data items having the column location. A column, which is analogous to a data "field", is applicable to all rows in the table.
A "row" in a relational table may be defined as one unit of the horizontal dimension of the table. A row contains data items from each column in the table although data items may be missing from some of the row and column locations. All columns are considered to be in the same order for all rows. A row is thus analogous to a "record" in a non-relational database.
There is no presumption of the order of rows within a table. An index can be created by the user to logically order the rows of a table in different ways, each index thus defining a separate ordering. The term "index" may thus be defined as a collection of the columns in a table with ascending or descending order directives. Once defined, an index is internally maintained by the database manager. It is never referenced in user requests for manipulating or retrieving data, but is used internally whenever an access is determined to be faster using an existing index than through a sequential file search.
An index can be created at any time and dropped at any time. There is no requirement that any index exists for a table.
A "base table" is a relational table that is physically stored on a record medium as real data. A "view" is a relational table that does not exist in physical storage but is derived from one or more base tables.
Given this type of data, a "database" may be defined as a collection of relational tables, catalog tables and recovery logs. A database may, for example, be restricted to being within a single OS/2.TM. file system. This restriction ensures all catalog and recovery functions can be performed on any data under its control at any time. Databases may be stored on hard files or on diskettes. Multiple databases can be placed on a single OS/2.TM. file system.
Another term which is necessary to understand for the present invention is the so-called "codepage" environment of a database. A "codepage" may be defined as a translation table or mapping of the set of bytes (of which there are 256) into character representations produced by a printer, display, or other input/output device. Well known codepages include the ASCII (American National Standard Code for Information Interchange) IBM National and International codepages 437 and 850, respectively, as well as the EBCDIC character set (Extended Binary-coded Decimal Interchange Code).
In interchanging data between one database and another, it is necessary to know, and keep track of the codepage of the data so that data integrity may be maintained from one codepage to another, if required. However, the interchange method should also allow the structure that exists on the source database to be created on the target database. This means that the table name, column names, column types and indices must be carried as part of the interchange.