Metadata in its simplest form is data about data. Metadata exists as a secondary component of every data processing operation, and has existed for as long as the first programs were written. Metadata has been generated since the first program was written and since the earliest file was designed. As programs are written, as data is loaded, as users execute reports--metadata is created as a natural by-product of processing.
FIG. 1 shows metadata that exists to describe different aspects of programs, files, and other components of the information processing environment. Some simple examples of metadata include: the structure of a file; the definition of an element of data; the count of the records in a database; the data model describing the higher structure of several databases, and so forth.
In the 1960's very technical metadata was collected and placed in a catalog. In the 1970's technical metadata was placed in a data dictionary. In the 1980's and early 1990's metadata was placed in a centralized repository. Whether it be a catalog, a dictionary, or a repository, the notion of the collection and management of metadata has been done on a centralized basis.
Typically metadata is collected and is attached to a database (i.e., either a proprietary or a commercial database) residing on a server as a set of data, as seen in FIG. 2. Once the metadata was collected centrally, it could then be accessed by distributed parties. However, upon access by distributed parties, no controls or discipline was placed on the metadata after it left the centralized repository.
There were two reasons for the centralization of metadata: (1) most organizations that wanted to collect metadata operated in a centralized mainframe environment, and (2) there was the notion that data needed to be managed centrally if there was to be uniformity and consistency of metadata across the organization.
The notion of a centralized approach to the management of metadata persisted until the 1990's. But in the 1990's there was a strong shift away from centralized systems. With data warehousing and DSS processing came truly decentralized processing. As data warehousing matured, different types of data warehouses emerged. There were enterprise data warehouses. There were data marts. There were operational data stores. There were exploration warehouses, and so forth.
Soon the world of computing was anything but centralized. There were attempts to apply the old centralized models of metadata management to a distributed world, but these attempts were structurally out of phase with each other.
There was an attempt at distributing centralized metadata. This attempt was called the "hub and spoke" approach to distributed metadata. In the hub and spoke approach, metadata was stored and managed in a central location. Then as the metadata was needed outside of the centralized environment the requested metadata was sent--replicated--to the outlying requestor.
In the hub and spoke approach, metadata could be transported from a centralized location to a distributed location. But once having arrived at the distributed location, the metadata entered a nebulous and undisciplined world. The metadata, once distributed, simply became another unit of data at the outlying location, with no special properties, privileges or discipline there. Once at the outlying location, the metadata could be modified, could be passed to anyone else, could be renamed, and so forth. In short there was little or no discipline for the use and management of the metadata once the metadata left the confines of the central metadata management facility.
Another approach is that of the "spoke" approach. In the spoke approach (where a site or a server is the spoke) there is no hub, but otherwise everything else is as described in the "hub and spoke"approach. Metadata is sent from a variety of sources to the outlying sites/servers, but there is no attempt to otherwise manage the metadata once it has arrived at one of the spokes. Once at the spoke, metadata can be altered, can be sent to another spoke, can be renamed, etc. Furthermore, there is no comparison or management of the ownership of metadata across different sites/servers once the metadata has arrived at the spoke.
In order to create a world in which metadata is distributed and is managed with discipline at both the hub AND the spoke, some innovative approaches are required. In order to establish enterprise integrity of metadata for distributed metadata, there need to be some conditions that are created and are reinforced by software across the enterprise.
Several U.S. Patents describe aspects of data warehousing that involve tables and indices of metadata U.S. Pat. No. 5,727,197 discloses a method and apparatus for segmenting a database. The database is divided into multiple data segments, each of which may be independently stored on one of a variety of storage devices. The disclosure is directed to the problem of indexing an incoming data stream.
U.S. Pat. No. 5,721,903 discloses a system and method for generating reports from a computer database. A data warehouse is shown in FIG. 1 of the '903 patent, along with associated metadata. The disclosure does not address the problem of maintaining the integrity and synchronizing distributed metadata
U.S. Pat. No. 5,706,495 discloses encoded-vector indices for decision support and warehousing. The disclosure relates to the structure of the metadata, rather than the issues of maintaining a system of record.
U.S. Pat. No. 5,675,785 discloses a data warehouse which is accessed by a user using a schema of virtual tables. An intelligent warehouse hub interfaces the user and the warehouse database.
An organization called the Meta Data Coalition has been working to establish a format for metadata interchange between software tools from different vendors. The Coalition has published Version 1.1 of the Metadata Interchange Specification, dated Aug. 1, 1997. Certain formats are specified in the interchange files to include information on versioning, data elements, and a "configuration profile" which describes the legal flow of metadata. The specification does not disclose how to use the header information to carry out the "system of record" according to the present invention.
The present invention addresses the problem of maintaining integrity and synchronicity of metadata in a distributed computing environment, and particularly in data warehouses, wherein servers can gain access to metadata stored in different sites within the network.