The invention disclosed herein relates to a data repository in which metadata stored in the repository is reused, for example, not replaced by identical metadata when the repository is updated or populated, and in which the criteria governing reuse is accessible and modifiable by a repository user.
In today's environment, most large organizations suffer from too much data and not enough information. An enterprise may have vast information assets in many different applications. These assets often were developed by different teams and may be scattered throughout the enterprise. For example, an enterprise may have a number of heterogeneous RDMSs (DB2, IMS, Sybase, Microsoft SQL Server, Oracle, Teradata, Informix). This makes it difficult to determine where information is located, and how it is being used. Without sufficient accounting, information redundancy, inconsistency and underutilization occur or grow.
Large legacy systems (e.g., COBOL, PL/I Assembler, C, JCL, SQL) represent still another information asset management problem. These systems were developed over many years; the code is often undocumented, lacks modularity and is replete with complex dependencies. As a result, maintaining legacy systems is typically resource intensive and in the context of the overall enterprise, the return on the use of the resources is low.
In large enterprises, it is not unusual for CASE analysts and programmers simultaneously to work on similar projects sometimes without knowledge of the other's work, which can result in costly duplication of effort and a lack of universal standards. This is yet another information management problem across an enterprise.
The data repository addresses all of these problems.
Data warehousing similarly presents the need for information about an organization's vast stores of data in order to make the data more accessible and usable. Through scrubbing, migrating, recalculating, translating, and other data manipulation efforts, these projects can offer information that was previously unavailable. In many cases, however, data warehousing efforts lack a metadata directory that allows information workers to find and understand this new data. While a business may spend thousands or millions of dollars creating a data warehouse, without descriptions of the available corporate data this effort is futile. The information workers must understand what data is available and where they can find it. The data repository provides this capability.
A data repository is a shareable collection of information supporting an organization's or enterprise's data and data processing functions. The data repository stores metadata, i.e., data about data, including data containing descriptions and definitions of the structure, content, keys, indices, etc. of other data, applications and systems components. A data repository may also be a place of record for other information assets, such as queries, reports, and documents that benefit from sharing. The metadata for generally maintaining, running, analyzing, modifying and updating data and applications may be used and shared for many purposes, including, to name a few, data modeling, tool development, application development, CASE support, data warehousing support, impact analysis, and data resource management.
Among the benefits that a repository provides are the following.
(1) Makes the application development process more efficient through: better utilization of existing resources (no one person needs to be skilled at all implemented data bases and tools in an organization); identification of data redundancies (data rationalization); establishing naming standards (data rationalization); identification of opportunities for reuse of standard definitions across disparate tools; providing impact analysis (assess where data is used before implementing a change; support of data sharing (one place to share multiple modeling efforts (e.g., CASE) and consistent definitions; providing an inventory of existing data structure models (for example, those that contain year 2000 affected fields); leverages investment in legacy systems with the ability to inventory and utilize existing application information.
(2) Offers an opportunity for central point of control for data through: providing a place of record for data about an organization's information assets (like a card catalog for all data); provides a universal relational model for heterogeneous RDBMSs to interact and share information thus providing an unlimited opportunity for data documentation (far more than a typical DBMS); supportilng model extensibility to include metadata definitions from non-traditional sources and recording data ownership/stewardship.
(3) Leverages data across the enterprise so an organization can be more competitive by: understanding data properties, uses, locations, characteristics, owners, and relationships; providing organization and documentation of data audit trail (where data comes from and how it is changed) so information can be repeatedly derived with consistent results.
(4) Aids in the downsizing and/or migration to client/server architecture by: assessing and documenting what data is available today before initiating future data architectures and migrations; facilitating the migration process by generating physical data structures from existing data definitions.
(5) Assists with data warehousing by acting as a guide to warehouse sources, targets, and data translations (relationships between elements), and by capturing versions of warehouses--snapshots of moment in time.
A data repository may be populated with database metadata by scanning the database's code, identifying data to be imported into the repository, preparing metadata and loading the metadata into the repository. Since data existing in the repository prior to a load may be identical to data obtained from a database scan, the identical data was reused, i.e., the scanned identical data was not loaded into the repository. Without such data reuse, (1) a data load took much more time that necessary, and in a repository for a large number of databases and applications, could make the repository unworkable; and (2) identical data could be stored multiple times, which spawned still further problems related to memory waste and version control.
Data reuse in other applications by caching is described in U.S. Pat. Nos. 5,511,208 and 5,404,488, UK patent publication GB 2,294,132A and International Publication WO 95/16956.
While a data repository provides many benefits, as for example described above, without the invention disclosed herein, prior repositories suffered from a serious drawback: hard-coded reuse criteria which was not accessible by a repository user. This made data reuse in the repository inflexible and impractical since an experienced programmer was needed to change the reuse criteria. This drawback drastically diminished the effectiveness of a repository, and without user access to the reuse criteria, repositories still had essentially the same serious problems discussed above.