1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer systems. More particularly, but not by way of limitation, one or more embodiments of the invention enable an invention for modeling a master data repository to evaluate the efficiency of the repository schema and to evaluate the accuracy and validity of the data and metadata within the repository.
2. Description of the Related Art
Accuracy and validity of data in a large commercial data repository is of critical importance. However, working with large data repositories often means working with data of questionable accuracy and validity. Such repositories may involve hundreds of tables and millions of records. Data may sometimes be imported from one or more sources into a master repository. Data may originally have been input by many different operators, over varying periods of time, using different input programs with different levels of quality control and error checking. The overall cost to an operation to determine data accuracy and validity of such repositories can be significant, assuming it is even possible at all. But the overall cost of relying on inaccurate or invalid data can also be significant.
Furthermore, data imported or input into a large data repository is almost always associated with metadata. Metadata is generally defined as “data about data.” That is, metadata is information that describes the meaning or significance of data. “Columbus” may be data in a database or data repository. While we may guess that “Columbus” is a proper name, we cannot be sure without its associated metadata if Columbus is the last name of an explorer, a holiday, the name of a book, the subject of a painting, or a city in Ohio, Ind., or Ga., just to name a few possibilities. Without accurate metadata the data “Columbus” is at best meaningless, at worst detrimental. For example, if Columbus is a city, but the data is erroneously imported such that the data ends up in a database field for “Street Name” an organization relying on this entry may ship a product to a non-existent address because of invalid data. Or a critical part might be shipped to Columbus, Ohio instead of Columbus, Ga., if the city field held “Columbus” and the “State” field was blank, all because of incomplete data in a master data repository.
Finally, large commercial data repositories are often changed and revised over their lifetimes. Tables and columns may be added, field meanings and purposes may be expanded, data dictionaries edited. As new data sources are married to the repository the efficiency of the originally well designed database schema may degrade with duplicate fields, and poor choices in dependencies between fields introduced by well intentioned maintenance engineers.
There are currently no known systems that support profiling the generalized hierarchy of large commercial data repositories for evaluating the efficiency of schema as well as validity and accuracy of data and metadata within a master data repository.
For at least the limitations described above there is a need for a tool for modeling a master data repository to evaluate the efficiency of the repository schema and to evaluate the accuracy and validity of the data and metadata within the repository.