1. Field
Embodiments of the invention relate to data processing systems. More specifically, embodiments of the invention relate to correction of erroneous data in a database.
2. Background
In data warehousing, data is typically copied from transaction systems and restructured to facilitate querying and reporting of the data. When different systems are involved errors, such as the incomplete, incorrect or inconsistent data may occur. Identification of and cleanup of such errors requires significant manual work performed on the database using structured query language (SQL) statements. This requires users to have a significant knowledge of SQL and tends to be both time and labor intensive.
Outside the database domain some tools exist for large scale maintenance of incomplete or inconsistent data. Particularly, semantic nets are structures that, in contrast to database structures, are rather flexible and do not generally enforce compliance to predefined schemes. Instances of the same entity type can have different kinds of attributes and cardinalities of relationships can vary. Therefore, semantic nets can be considered an extreme example of potentially incomplete and inconsistent data. Semantic net is composed of nodes and relationships between these nodes. The maintenance tool for semantic nets uses the following concepts for mass manipulation of data:                a “bag” (set) of nodes in the semantic net. The bag object has two special features: It can evaluate which relations all nodes in the set have in common and it can also determine which relations differentiate subgroups of nodes in the set from each other (“dynamic grouping”). Furthermore, the bag allows the mass maintenance of relationships for all nodes in the bag.        a “filter,” which selects those nodes from a set which have certain relations        a related entity processor that, starting from a set of nodes, collects the set of nodes which are of a certain category and related to the nodes in the set by a given relation type        a macro recorder that allows sequences of manipulations of bags and node relationships to be stored.        
These features make it possible to detect and correct the inconsistencies and gaps that inevitably occur in a manually maintained semantic net.
Unfortunately, in a data warehousing context these features cannot be directly applied. Specifically, the semantic net tool has no notion of a separation of key and human readable names. The identification of nodes is done only by name, which fails as soon as two or more nodes have the same human readable name. Additionally, semantic nets have no concept for evaluating attribute values. Finally, the existing semantic net tool performs all mass operations on nodes within the main memory. This is memory intensive and likely to fail on almost all large data warehousing cases due to insufficient memory.