1.0 Field of the Invention
This invention relates to the discovery of relationships in data; and in particular, this invention relates to tolerant and extensible discovery of relationships in data using structural information and data analysis.
2.0 Description of the Related Art
The wide adoption of an Internet-based business model over the past decade has caused an explosion of information. Some estimates suggest that two hundred fifty megabytes of information are generated every year for every living man, woman and child. As a result, the volume and heterogeneity of available data is increasing at a pace that far outpaces our ability to absorb it. In such an environment, the advantage lies with enterprises that can harness this information into a valuable asset to gain a competitive advantage in the marketplace.
This is no easy task. An enterprise typically acquires a multitude of disparate databases, content repositories and file systems as it matures. Because such systems are usually developed independently, these information sources may reside in different locations, be managed by different departments, and even store similar or related data according to different data models and schemas. Often, the high-level objective of a new application is not to start from scratch, but to find, integrate and combine or recycle existing assets into new applications. For example, one such application may be to develop a single view of customer data that may be spread across an enterprise in different databases, departments and locations so as to synchronize that customer's information for various purposes, such as to avoid contacting the same customer multiple times, to propagate changes to a name and address to each system, and to be able to have all information about a customer available when that customer places a service call. Another application may be to locate all of the information sources that reference a particular business concept, such as a purchase order or bill of materials. Yet another application may be to identify independently defined and redundant information assets across an enterprise which has multiple XML (eXtensible Mark-up Language) Schemas, or relational schemas that define customer information, or Web Services Description Language (WSDL) files that describe the same or similar Web Services. Typically an XML Schema exists as a file with an extension of “xsd”. An XSD (XML Schema Definition) model is defined using the XML Schema Language. For example, the World Wide Web Consortium (W3C) publishes a description of the XML schema in the “XML Schema Part 0: Primer Second edition, W3C Recommendation 28 Oct. 2004.” An XML Schema is also referred to as an XSD model. A WSDL file is a document, written in XML, which specifies the location of the Web service and how to access the Web service.
A significant challenge to building the aforementioned applications is to discover if and how information from multiple data sources may be related. Data is typically stored in multiple sources, such as relational databases, files, applications and queues, and the structure of the data can be described in multiple formats such as relational tables, XSD models, comma separated text files and WSDL files, to name a few. Oftentimes, the structure of the data can be quite complex. An XSD model may have hundreds of XML elements, and a relational table definition may have hundreds of columns. Discovering relationships among information sources may involve multiple levels of analysis such as structural comparison, data comparison and semantic analysis.
In structural comparison, metadata describing the structure of the data in a data source is analyzed. For instance, in a relational database management system, data is stored in tables which have rows and columns; and, in some relational database management systems, a schema contains one or more tables. The database management system contains metadata which specify the names of the tables belonging to a schema in a schema definition. The database management system also contains metadata which describes a table in a table definition. The table definition specifies the names of the columns of the table and the type of data stored in the columns. The similarity of two columns can be determined by comparing the column names and the data type of the columns. The similarity of two tables can be determined based on the similarity of the columns of the tables and a comparison of the table names. The similarity of two schemas can be determined based on the similarity of the tables of the schemas and the similarity of the schema names.
In data comparison, the data contained in the data sources is analyzed. For example, the similarity of various columns can be determined by analyzing the overlap in data values, the distribution of data values and/or the signature of the data values contained in the columns. The signature of the data values refers to the characteristics of the data values in the columns, for example, the position and/or grouping of alphabetic, numerical and special characters within each data value.
In semantic analysis, dictionaries, term expansion rules and other domain-specific knowledge are used to analyze various aspects of metadata. For example, the term “FVT” can be expanded into “functional verification test” based on information in a dictionary.
A number of heuristics and algorithms have been developed over the years to analyze the structural definitions of data as well as instance data, that is, the data contained in a data source. Examples of heuristics and algorithms include structural and semantic name matching, signature algorithms such as described in “Attribute Classification Using Feature Analysis” by F. Naumann, C. T. Ho, X. Tian, L. Haas and N. Megiddo, in the 18th IEEE International Conference on Data Engineering (ICDE), San Jose, Calif., Feb. 26-Mar. 1, 2002, statistical analysis such as described by Ashraf Aboulnaga, Peter J. Haas, Sam Lightstone, Guy M. Lohman, Volker Markl, Ivan Popivanov and Vijayshankar Raman in “Automated Statistics Collection in DB2 UDB” Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004, and others such as described by Jayant Madhavan, Philip A. Bernstein and Erhard Rahm in “Generic Schema Matching with Cupid,” Proceedings of the 13th VLDB Conference, Roma, Italy, 2001, pp. 49-58, and by D. Caregea and T. Syeda-Mahmood in “Semantic API matching for Services Composition,” Proceedings of the 13th International World Wide Web Conference on Alternate track papers & posters, New York, N.Y., 2004.
Algorithms which run independently on the same sets of metadata can reinforce one another's results. For example, suppose that a structural matching algorithm suggests a relationship between a column named “creditClass” on one source and a column named “creditRating” on another source. A second algorithm that performs statistical analysis on the data in the columns may find that the columns contain the same values; therefore, also suggesting a relationship between the columns.
Conversely, the results of one algorithm may invalidate the results of another algorithm. For example, suppose that a name matching algorithm indicates that a column named “SSN” of a first table is a match to a column named “SSN” of a second table. This algorithm indicates that the columns are the same, and may also make the additional inference that both columns contain social security numbers. However, another algorithm which analyzes the signature of the data in the columns may find that the format of the data in the column in the first table is NNN-NN-NNNN, while the format of the data in the column in the second table is AANNNNN. Thus, the signature of the data indicates that the columns probably do not represent the same data.
Since algorithms that perform discovery are developed independently and for different purposes, combining their execution and their results is a difficult task. Each algorithm typically requires a structural description of the data using its own model, such as relational tables, the XSD model or a comma-separated text file, and may only analyze data of a specific format. In addition, each algorithm typically computes its own measurement which indicates the strength of a relationship among metadata. For example, in some algorithms a measurement may have a value between zero and one, while in another algorithm a measurement may have a value between zero and one hundred. In addition, a value of zero may indicate a strong relationship; and, alternately a value of zero may indicate a weak relationship.
Therefore, there is a need for an improved technique that allows multiple algorithms to be used to discover relationships in metadata. The technique should also be extensible, allow additional algorithms to be incorporated, and allow the results of the algorithms to be combined. In addition, the technique should allow relationships to be discovered over disparate sources having multiple data formats. The technique should also provide for data sampling access across distributed, heterogeneous sources.