1. Field of the Invention
The present invention generally relates to database management, and more particularly to a database management system and method that cleanse enterprise data stored in a plurality of databases.
2. Brief Description of the Related Art
Storage of information in a storage medium is generally facilitated by using a database in conjunction with a database management system (DBMS). A database is a collection of related data that may be stored on a nonvolatile memory medium. Data in the database may be organized in a two-dimensional row-and-column form called a table. Databases typically include multiple tables.
A table is an object in the database having at least one record and at least one field within each record. Thus, a table may be thought of as an object having a two-dimensional record-and-field organization. A record is a row of data in the table that is identified by a unique numeric called a record number. A field is a subdivision of a record to the extent that a column of data in the table represents the same field for each record in the table. Each field in a record is identified by a unique field name, which remains the same for the same field in each record of the table. Therefore, a specific datum in a table is referenced by identifying a record number and a field name.
A database management system (DBMS) is a control system that supports database features including, but not limited to, storing data on a memory medium, and retrieving data from the memory medium. Data in the database is typically organized among a plurality of objects that include tables and queries. An individual table or query may be referred to as a record source since it is a source of data or records from the database. A query object is an executable database interrogation statement, command, and/or instruction that communicates to the database management system the identity and location of data being extracted from the database. The product of an executed query is called a result set. The result set may be stored and/or manipulated as a two-dimensional object similar to the table discussed previously.
A relational database is a commonly used database type that is managed by a database management system. Data in a relational database is distributed among multiple record sources that are typically related, or normalized, in a manner designed to minimize redundant data in the database, minimize the space required to store data in the database, and maximize data accessibility. Record sources in the database may be related to each another via key fields. A normalized database is one in which each record source in the database is directly related to at least one other record source in the same database by key fields.
The key field can be a primary key or a foreign key. A primary key is a field or combination of fields in a record source that includes unique data for each record in the table. A foreign key is any non-primary key in a record source that is the basis for a direct relation with any other record source. A database remains a relational database regardless of the degree of normalization that exists. Record sources in a normalized relational database are typically related. However, a relational database may be normalized even if the database is disconnected; in that at least one record source in the database is not related to any other record source by a key field.
Relationships between any two record sources in a relational database may be either direct or indirect. Such a relationship may also be referred to as a relation or join. A direct relationship exists between two record sources if there is no intervening record source in the relationship path between them. An indirect relationship exists if there is at least one intervening record source in the relationship path between two record sources.
The record sources in a relational database and the relationships between them define the geography of a database, which may be called the database schema. A sub-schema of the database is any subset of the full database schema, which is defined by a query, a result set of a query, or any other subset of record sources from the database. A database schema and database sub-schema may be displayed visually in graphic form as a graph having edges or arrows representing relationships between record sources, and vertices, also known as nodes or tables, representing the record sources at either end of a relationship.
Queries are used to access data in a database. The query may be constructed in accordance with the Structured Query Language (SQL), which may or may not be based on the American National Standards Institute (ANSI) standard SQL definition. To access data in a database, a user may construct a query using an SQL. Executing a query is called a join or joining wherein each relation identified in the query is joined during execution to retrieve the desired data from the database.
Metadata provides information concerning data in the database, such as how, when, and by whom the data is collected, and/or how the data is formatted. Metadata is important for understanding information stored in data warehouses and XML-based Web applications.
Data warehouses are collections of data that are adapted to support management decisions, such as an assortment of data providing a coherent view of business conditions at any given time. The development of a data warehouse includes development of systems to extract data in operating systems and installation of a warehouse database system to provide flexible access to the data. Data warehousing refers to the linking of different databases across an enterprise. An enterprise generally refers to any large business organization.
Enterprises are often forced to use and maintain a multitude of independent databases that may represent different views of the same or similar information. For instance, in the telecommunications industry, information concerning a particular network is often stored in various databases, which provide different organizations or views of the data taken at different times. Each of these databases is generally tailored to be most useful to the individual or department, such as sales, billing, ordering, or provisioning, intended to access them
Over a period of time, information in these databases loses synchronization with, for instance, the physical network, which may be represented by the network database. The databases may also become inconsistent with each other through, for instance, failing to perform updates in a timely manner or simply not having the information available to revise a particular database at any given time. In the past, these inconsistencies were primarily corrected by manually browsing and fixing individual errors. However, this has proven to be a very inefficient process that is impractical to perform at regular intervals.