1. Field of the Invention
This invention relates generally to databases. More particularly, this invention relates to a method for loading data into a database.
2. Background and Related Art
Every database management system is based on a general database model. The following are examples of well-known database models: the hierarchical model, the network model, and the relational model. A database management system based on the relational model may be referred to as a relational database management system (RDBMS). An RDBMS is a system of computer programs that facilitates the creation, management, and manipulation of relational databases.
Every relational database is based on the relational model. The relational model is familiar to one of skill in the art. The book "An Introduction to Database Systems", by C. J. Date (Addison Wesley Publishing company) provides an in-depth guide to the relational model, and hereby is incorporated in its entirety by reference. An example of an RDBMS is DB2, which commercially is available through International Business Machines Corporation.
According to the relational model, data is perceived to exist as a collection of relational tables. A relational table expresses a relation between things. Relational tables are characterized by rows and columns. Although the rows and columns of relational tables may be employed in many ways, the relational model provides that columns pertain to entities or attributes of entities, and that rows pertain to specific instances of entities or specific instances of attributes of an entity.
The rows and columns of a relational tables intersect to define data cells. In this discussion, the term record may be used to refer to a row; the terms attribute and field may be used to refer to a column.
Although the structure of the relational model provides for tables, rows, columns, and cells, a certain hierarchy may be observed within the model. That is, a relational database comprises one or more tables; each table comprises one or more rows; each row comprises one or more cells. Thus, the relational model defines four adjacent layers of hierarchy: databases, tables, rows, and cells. The tables layer is the next higher layer of the rows layer. The cells layer is the next lower layer of the rows layer. The tables layer is adjacent the rows layer, but is not adjacent the cells layer. Moreover, a given table may be referred to as an instance of the table layer, a given row as an instance of the row layer, and so on.
Although the relational terminology of tables, rows, columns, and cells is used throughout this description, one of skill in the art will appreciate that the concepts presented herein may be applied outside of the relational model to great advantage. In particular, the concepts are applicable in any database environment in which the data model similarly includes a hierarchy of adjacent layers.
Each column of a relational table has a respective datatype. The datatype of a column restricts the values which the cells of that column may be. For instance, a traditional datatype for a column of a relational table is the integer datatype. If a column has the integer datatype, the cells of that column may have only integer values. Variations on the integer datatype include the small and the large integer datatypes. The small integer datatype is so named because it conventionally is limited in length to half of a word. The large integer datatype, by contrast, may be allocated two words.
Other traditional datatypes include packed decimal, floating point, fixed length character, and variable length character datatypes. As is the case with the integer datatype, variations exist with respect to the other datatypes. Some special purpose variations of the traditional datatypes include logical, money, date, and time.
RDBMS's recently have been improved to provide support also for some nontraditional datatypes. Some supported datatypes include images, video, fingerprints, large objects (LOBs), and audio. In other words, a cell of a relational table may now contain data that is an image, a video segment, a fingerprint, text of great length (such as a book), or an audio segment. Thus, the columns of a relational table now may have nontraditional datatypes as their respective datatypes. Other nontraditional datatypes either presently are or soon will be supported. Examples of other nontraditional datatypes are spreadsheets, lists, and tables, to name but a few.
Applications programs access the data of relational tables by making calls to a database server. Used in this sense, the term "applications programs" may refer to several separate programs, only one program, a module of a program, or even a particular task of a module. An applications program may be written by an applications programmer. Applications programmers develop applications programs using any of a number of programming languages. During development and design of applications programs, applications programmers may adhere to a programming methodology. A programming methodology is a set of principles by which analysis is performed and by which design decisions are made. Programming methodologies may be referred to as programming paradigms. Examples of widely-known programming paradigms include the top-down, the data-driven, and the object oriented (OO) programming paradigms.
Turning now to consider the data, instead of the database, it may be observed that information in many organizations is held in digital form in repositories which are not part of the same data library, the same computing systems or even the same administrative domain. This has hampered access to the information held in those separate repositories, even though the information held separately may be related. For example, an organization may have information residing in completely different data processing systems. These different data processing systems may be in place as a result of combining previous projects, or because of mergers or acquisitions of companies having different data processing systems. It is a common occurrence that valuable data resides and is used in separate and distinct libraries, computing systems or administrative domains.
A problem many such organizations face is that information held in such heterogeneous data stores may, in the minds of people within the organization, be related conceptually. Such data, however, remains unrelated at a data processing level. In other words, the information in one database is not accessible along with the information in another database. Hence, that information can be difficult to handle, and the full value of it unrealized until the unrelated data is joined. Collected into carefully managed records, such information is at the core of what it means to have a library. If the collection is held in digital format, it is known as a digital library.
A digital library as described in U.S. Pat. No. 5,649,185 to Antognini et al., which is incorporated herein by reference. A digital library uses a database, but also allows application programs, residing on a library client, to interact with the underlying digital library services and hence the underlying database, to store and retrieve information.
One way to add information to a digital library is to incorporate the source information from wherever it occurs into this specialized repository. This way of adding information is the primary subject of the invention.
For the sake of clarity, certain terms will now be discussed. The term target digital library means a digital library or a database that a user is using or desires to use. The target digital library requires data to be in one of a plurality of target data formats. The target digital library typically has many target formats, and might have one target format for each table defined within it.
The term unusable data, or source data, refers to data that is stored in a form not directly useable by the target digital library because it is in a form that does not match one of the plurality of target data formats. Source data is typically available from a source database or a source data store (i.e., a magnetic tape, disk, or the like). The unusable data is said to be in a source form, to have a source format, or to have a source data format. To be usable to the target digital library, the source data may be converted from the source data format to one of the plurality of target formats of the target digital library.
In loading data into a target digital library, a preliminary step is usually to create a dump file. A dump file is often produced by an ASCII dump of the source data from the source database. It will be understood that an ASCII dump is a feature commonly available in nearly every database management system, and in nearly every computer system. For example, data preserved on reels of tape may commonly be dumped to a dump file in ASCII. It will be appreciated that ASCII is here used merely as an example, and that EBCDIC or any other manner of representing data may instead be used. It also will be understood that a dump file need not necessarily be a file stored on a disk, but may include a stream of electronic impulses which are generated and provided to a process without any intermediate storage per se of the data.
A dump file can be of many different formats vis-a-vis how the data is logically separated. In one example of a dump file format, records are separated by one or more separator characters. In another example of a dump file format, there is one record per line. In yet another example of a dump file format, there are multiple records per line. Likewise, fields may be distinguished one from another by separator characters, lines, or the like, and may be fixed or variable in length.
In the target digital library, there are a plurality of target data formats. This plurality of target data formats may number in the hundreds. For the sake of clarity, the target data format that the source data must be converted into shall be referred to as a desired target data format. The selection of the desired target data format will depend on how, logically, the source data is to be included in the target digital library. A term which may be used interchangeably with target data format is the term index class. A digital library thus may be said to include a plurality of index classes.
One approach to working with source data in a source data format that is not one of a plurality of target data formats is to write a custom loader application. In other words, to load the data from a dump file, an application programmer writes a custom loader application. Such a custom loader application must understand the format of the dump file, must read the fields from the dump file, and then must assign the right value from the dump file to that of the desired target data format corresponding to the desired data structure. This assignment must be based on knowledge of the record structure of the target digital library.
A problem with the use of custom loader applications is that there are so many different formats possible for the source data, there are typically many input files of source data all in different source data formats, and there are many different target data formats. The problem, more particularly, is that many custom loader applications must be written. The writing of custom loader applications may be time-consuming, and such applications often are non-reusable.
Another approach to working with source data which is in a source data format that is not one of a plurality of target data formats is described in U.S. Pat. No. 5,421,001 to Methe. Methe suggests an improved method of writing custom loader programs. According to Methe, there must be provided a common interface between all of the multiple foreign file formats (i.e., the source data format and the plurality of target data formats). This common interface is to be achieved by translating the elements of the source data format and the plurality of target data formats (which must be known a priori) into what amounts to a third, common format. The Methe approach allows an application programmer to use this common interface and common format for reading and writing in the multiple foreign file formats. In other words, the Methe approach applied to the problem of creating a suitable loader program is to write the software so as to translate the source data format and the plurality of target data formats into a predetermined common format upon opening the dump file, to write statements that manipulate the fields of the records in this common format, and then write statements that translate the data from this common format into the desired target data format(s) for writing into the target digital library.
The Methe approach allows an application programmer to reduce development time by being less concerned about differing file formats. The application programmer can be less concerned about differing file formats because he can write the data manipulation statements with the predetermined common format in mind. Although the use of a predetermined common format thus may be advantageous over the approach of writing a custom loader application from scratch, the approach is not without its shortfalls.
One problem with the Methe approach is that the application developer must decide what component or components of the source data in the source data format are to be read as he writes the loader program. Likewise, the application programmer must also decide the locations or locations of the target digital library (and, correspondingly, the desired target data format or formats) to which the source data, after conversion to the common format, is to be written. These data correspondence decisions thus are statically bound upon the compilation of the program. Thus, adopting the Methe approach makes it impossible to alter this decision without rewriting the loader program.
The custom loader application approach and the Methe approach both suffer from the drawback that the data correspondence decisions are coded into the loader applications.