1. Field of the Invention
The present invention relates to a method for data correlation, and, more specifically, to a method for correlation of information across distinct domains.
2. Description of the Related Art
In information rich environments, knowledge of the relationships between information artifacts (such as software applications, datasets, websites, news articles, links, or many other types of information and data) is necessary to ensure that relevant information is made available when and where it is needed. The World Wide Web provides a readily available example, with its vast collection of documents and the familiar task of creating search strings to locate desired documents. Numerous examples of similar information retrieval and organization tasks exist in any scenario that involves the production and consumption of information, including government intelligence communities, where collectors and reporters produce information artifacts that must be disseminated to the necessary consumers; criminal investigation and legal services, where vast amounts of documentation must be organized and searched to discover information relevant to a given case; news services, where stories must be categorized and linked based on topical similarity; and customer service management where customer requests and complaints must be routed to the relevant representative or directed to relevant information.
When the environment contains a large number of information artifacts, manual encoding of the relationships between artifacts becomes difficult or impossible. To solve this problem, a number of methods, such as the Term Frequency-Inverse Document Frequency (“TF-IDF”) method have been devised and implemented for automatically determining the relationships between artifacts.
Prior art methods typically determine similarity between two artifacts using features shared in common between the artifacts. For example, correlation between two documents—such as document A and document B—containing English language text can be determined by comparing the similarity of features such as words, phrases, or concepts contained in each document (often requiring some form of natural language pre-processing). The impact of any given feature on the document correlation metric would typically take one or more of the following into consideration:
1. Frequency of occurrence of the feature in document A;
2. Frequency of occurrence of the feature in document B;
3. Frequency of occurrence of the feature in the corpus;
4. Total number of features in document A;
5. Total number of features in document B;
6. Placement of the feature in document A;
7. Placement of the feature in document B; and/or
8. Domain or pragmatic knowledge about the feature (ontologies), among many others.
Such methods have proven effective for identifying artifacts with a high degree of correlation to a given artifact in cases where information in the given artifact and the collection is conveyed using similar features. In cases where the given artifact and the collection use disparate sets of features, this methodology is ineffective due to the lack of feature co-occurrence (i.e., occurrence of a feature both in the given artifact and the target artifact) or the requirement to build and maintain large, complex, and dynamic ontologies.
This disparity between features or artifacts in a given artifact versus a target collection can occur for a number of reasons, including differences in language or culture (for example, an English language query targeted at a collection of French language documents), differences in collection purpose (for example, a marketing brochure matched against a collection of detailed product specifications), differences in format (for example, entries from the sales tracking databases from two companies), differences in sub-language (for example, using a chemical research paper to find similar papers in a repository of medical research papers), or natural shifts in terminology over time (for example, news articles placed in predefined categories based on similarity to legacy news articles), among many, many others.
As a result, there is a continued need for an improved information artifact methodology that correlates information artifacts across distinct domains, including where there is a lack of feature co-occurrence.