Today, much information is digitized, stored on disk, and accessed by database. These databases include on-line transaction processing (OLTP) databases and data warehouses. OLTP databases are used for handling transactions, such as bank transactions for deposits. Data warehouses are used for decision support, such as analyzing market trends in retail stores. Data warehouses can be huge, more than a few million megabytes. For example, a huge data warehouse can store all of a company's financial transactions and employment records so that an analyst can use the data warehouse to answer potentially any question about the company.
On the other hand, a data warehouse can be a small "data mart," holding just enough information for answering questions in a specific area. A salesperson, for example, may use a data mart stored on a lap top computer in making sales. The salesperson's data mart in this example would contain information, copied from a larger data warehouse, about that salesperson's customers.
When building data warehouses and data marts, there is often a need to transfer a large amount of data between databases. For example, a data warehouse storing information about a company's finances periodically needs data from an OLTP database. When "feeding" data from an OLTP database into a data warehouse, the data is usually first moved from the OLTP database into a temporary database and cleaned up. One kind of clean up operation is to transform all dates into a common format. As another example, it is desirable to build departmental data warehouses and integrate them later into a corporate data warehouse. In yet another example, it is desirable to transfer information from a data warehouse to a data mart on a lap top computer or onto a CD-ROM. Moving data between databases is useful for other purposes as well.
Copying an entire database is reasonably fast by using standard operating system utilities to make an exact, "binary" copy of all the files in the database. However, making an exact copy of a database is not so useful for building data warehouses, because OLTP databases and data warehouses are typically not identical. As a result, database owners prefer to incorporate new information into their existing databases, letting that information become a subset of the existing database, not a separate database.
Moving subsets of data between databases is a slow process. One cannot simply copy a subset of files in one database into another database and expect all the data to be integrated into the other database automatically because of the internal structure of databases. Specifically, two aspects of the internal structure of a database create difficulties: absolute database pointers in datafiles and metadata information in the database dictionary.
Some databases, such as relational databases, store their information in indexed tables. Thus, a relational database contains two fundamental kinds of objects: tables, which contain the user data, and non-table objects. A non-table object can be an index, which contains database pointers to the table upon which the index was built.
An object is stored in a single operating system file, called a datafile. When a datafile is added to a database, it is assigned an "absolute file number" which is unique throughout the entire database. The absolute file number of a datafile is an index of the corresponding entry in a control file, which contains operating system specific information about that datafile, such as the operating system file name.
A table object, which houses user data, is divided into one or more data blocks, containing one or more records (called "rows"). Rows contain one or more columns, which contain the specific information, such as a customer's name or automobile part number, the user stores. Since every data block belongs to some datafile, it is natural to identify each block by the absolute file number to which the data block belongs and by the file offset of the data block. The absolute file number provides an easy way to locate the operating system specific information necessary to open the datafile, and it is efficient to access a data block within an opened datafile by the file offset.
As a result, the combination of the absolute file number and the file offset as an absolute data block address ("absolute DBA") is a natural way to make a reference to any particular data block within the database system. An "absolute disk pointer" is a data structure that stores an absolute DBA as a pointer to a data block. Thus, an absolute disk pointer comprises an absolute file number and a file offset.
With reference to FIG. 2, database 200 contains ten objects: data dictionary 240; control file 242; five tables of user data, 250 to 258; and five index files, 260, to 268, built upon tables 250 to 258, respectively. Each object is stored a datafile, 210 to 232, and assigned a unique absolute file number (AFN) of 1 to 12, respectively. Data item 270 is found in table 250, stored at offset 300 in datafile 212 having an AFN of 2. Index file 260 built on table 250 contains an absolute disk pointer 280 pointing to data item 270 with an absolute DBA of 2:300. Given a disk pointer with an absolute DBA of 2:300, the corresponding data item is fetched by looking up the AFN in control file 242 to find the name of datafile 212 and other operating system specific information. With that information, datafile 212 is opened, and the block at offset 300 is retrieved. Similarly, absolute disk pointer 282 in index 268 has an absolute DBA of 11:200, pointing to data item 272 in table 258.
Two indexes may be built on the same table. For example, index 266 is also built on table 250 and includes disk pointer 284 pointing to data item 270. Disk pointer 280 and disk pointer 284 contain the same absolute DBA of 2:300 and consequently point to the same data item, data item 270.
A drawback to the use of absolute disk pointers is that it complicates the transfer of a group of datafiles between two databases. Every absolute file number within a database is unique, but absolute file numbers are not unique between two different computer databases. Thus, an absolute disk pointer in one computer database may contain the same absolute file number as an absolute disk pointer in another database, but the datafiles referenced by the two absolute disk pointers are completely different. As a result, the copied disk pointer must be fixed (or "patched") to point to the correct location.
For example, with reference to FIG. 2, database 200 is a database containing absolute disk pointer 270 having an AFN of 2, which according to control file 242 of database 200 is datafile 212. However, an AFN of 2 in database 300, shown in FIG. 3, is also in use, in absolute disk pointer 380, pointing to datafile 312. Therefore, the meaning of absolute DBA 2:300 is different in database 300 than in database 200. If table 250 and index file 260 are copied from database 200 to database 300, to produce database 400 shown in FIG. 4(a), then unchanged absolute disk pointer 280 incorrectly points to data item 370 instead of data item 270. Datafile 212 is assigned a new AFN, in order to be unique, since 2 was already in use for datafile 312. Consequently, absolute disk pointer 280 must be patched to be 5:200 to properly point to data item 270, or absolute disk pointer 280 would be aliased to the wrong data item.
Another reason why absolute disk pointers need patching is that it is possible for a copied absolute disk pointer to be invalid in the destination database. For example, copying datafiles 230 and 232 from database 200 to database 300 results in database 402 of FIG. 4(b). Absolute disk pointer 282 has an AFN of 11, but there is no datafile with that AFN in control file 342. Therefore, copied absolute disk pointer 282 is invalid in database 402. In this time, absolute disk pointer 282 must be modified to have an AFN of 5.
Since it is typical to have a very large number of absolute disk pointers within a datafile, it is problematic to use binary copies of datafiles within another database. Within the transferred datafile, every absolute disk pointer must be patched to avoid re-using an active absolute file number in the other database. Patching absolute disk pointers involves inspecting every part of every datafile for disk pointers and fixing each disk pointer. For large databases, this procedure becomes very time-consuming and may even be impossible if a database cannot recognize or enumerate all the disk pointers in datafiles. Thus, it is desirable to avoid absolute disk pointer patching when transferring a group of datafiles between databases.
Another issue that complicates the transfer of a subset of a database is the problem of integrating the metadata from a source database for that subset to a target database. Metadata is information about data in a database, such as data dictionary information about what tables are in the database, what columns a table has, what indexes have been built for a table, and so forth. Metadata describes objects in the database, and these objects may contain disk pointers.
A subset of a database may be self-contained; i.e. all the objects in the subset refer to only other objects in the same subset. For example, indexes in a self-contained subset point only to tables in the same subset. Thus, in the first example, datafiles 212 and 214 constitute a self-contained subset of database 200 because index 260 in datafile 214 was built upon table 250 in datafile 212. When a subset is self-contained, the metadata can be reconstructed without pointing to non-existent data.
On the other hand, if a database subset is not self-contained, there are "dangling" disk pointers in the transferred database objects. For example, there may be an index in the subset pointing to a table outside of the subset. Referring to FIG. 4(c), database 404 is the result of copying a database subset consisting of datafiles 224 and 226 from database 200 into database 300. Since index 266 in datafile 226 is an index file built on a table outside the subset, namely table 250 in datafile 212, index 266 contains dangling disk pointers. Naively reconstructing metadata for the objects of a non-self-contained subset and enabling all indexes in the subset will lead to having objects that point to non-existent data.
As a result, the conventional solution is to avoid patching absolute disk pointers and integrating metadata by exporting tables in their entirety into a high-level, data description language (DDL) format that does not employ pointers or separate metadata at all. In exporting a table, the database system generates a series of high-level statements that describe how to recreate the entire table and the associated metadata in the destination database.
When the exported data is imported by a destination database, the tables and metadata are reconstituted by executing the exported DDL statements through a high-level interface, such as a SQL layer. Export and import are two common, operations nearly universally supported. While importing and exporting groups of tables are functional, they are very slow. For example, importing a gigabyte (a thousand megabytes) of exported data can take a few days to complete. Therefore, there is a need for a means to rapidly transfer data between databases.
Export and import procedures also make publishing data for databases on CD-ROM less useful. If a CD-ROM is manufactured with a group of exported tables, the data stored on the CD-ROM must be imported into the target database in a time consuming operation. Moreover, the import procedure makes a copy of the information, reducing the limited amount of persistent memory storage, such as hard disk space. It is desirable to be able to use the data on a CD-ROM directly without having to import it.