Information integration is a challenging area in information technology. A problem while integrating different schemas is the lack of information about the schemas. Schemas of tables and databases used in applications and/or products can change significantly over a period of time. Product architects often change, and the continuity in schema design is often lost due to such changes.
Schema information is generally documented in design documents which are lost in the maze of documents generated over a period of time. For example, schema information can include multiple design documents per release (one for each phase of development cycle), and multiple types of design document for each cycle. As a result, there is a need for a tool that automatically discovers all relevant design documents and presents the lineage information of a schema to a data administrator.
Finding the correct lineage is difficult because a database schema undergoes significant changes over time. For example, very little similarity exists between the existing schema and the older schemas, as the names of product/component also change over time. For instance, finding a document PES SDD V1R2 (Policy Editor Storage) is a follow up of SM SDD V1R1 (Storage Manager), and is non-trivial. Performing this task manually by reading each document is tedious and disadvantageous.
Finding the correct lineage is also difficult because each component and/or product has hundreds of tables. Each schema can be documented in a different design document of a different release/version. For example, POLICY_STORE can be documented in PES_SDDV2R3, whereas NOTIFICATION_STORE can be documented in AM_SDDV1R2. Also, a person integrating two different schemas cannot be expected to know the location of design information for each of the potentially hundreds of tables.
A simple search of each column and table name in the design document is not enough to find the relevant information, as the name of the table and column name can be written in different ways in the design document. For example, consider a column entitled: Social_Security_Code or S_S_Code. In the design document, the following paragraph may be present for the column:
Social Security Code:Field is fixed, 1 byte field.Values are:1 = Domestic TIN2 = SSN3 = Foreign Business ID4 = NRA5 = Awaiting TINUtilize, if needed, to facilitate handling of SSN field.
Consequently, there is a need for a tool that is able to handle changes in a name of the column which could include, for example, addition of spaces, change in spelling, etc. Existing approaches include, for example, a regular text search will not work effectively as schema name and format changes significantly over time.