The disclosure relates generally to data retrieval systems and more specifically to a method, computer program and computer system for searching, navigating and combining large numbers of heterogeneous data sources with varying data characteristics. Examples of heterogeneous data sources may be found in, for example, U.S. patent application Ser. No. 13/070,193, entitled AGGREGATING SEARCH RESULTS BASED ON ASSOCIATING DATA INSTANCES WITH KNOWLEDGE BASE ENTITIES, filed on Mar. 23, 2011; U.S. patent application Ser. No. 13/070,238, entitled ANNOTATING SCHEMA ELEMENTS BASED ON ASSOCIATING DATA INSTANCES WITH KNOWLEDGE BASE ENTITIES, filed on Mar. 23, 2011; U.S. patent application Ser. No. 13/491,724, entitled LINKING DATA ELEMENTS BASED ON SIMILARITY OF DATA VALUES AND SEMANTIC ANNOTATIONS, filed on Jul. 8, 2012; and U.S. patent application Ser. No. 13/543,872, entitled LINKING DATA ELEMENTS BASED ON SIMILARITY OF DATA VALUES AND SEMANTIC ANNOTATIONS, filed on Jul. 8, 2012, are hereby incorporated by reference.
Businesses accumulate massive amounts of data from a variety of sources and employ an increasing number of heterogeneous, distributed, and often legacy data sources to store them. Although many data sources are available, navigating the large amounts of data in multiple data sources and correlating those heterogeneous sources with all the relevant data a user is interested in obtaining can be a difficult process. Searching and combining information across these heterogeneous data sources and varying data types requires users to be highly technical and understand how to use relevant query languages for each data source and then manually merge results.
Keyword searches are a popular way of finding information on the Internet. However, a keyword search can be undesirable in business contexts. For example, a business analyst of a technology company may be interested in analyzing the company's records for customers in the healthcare industry. Given keyword search functionality, the analyst might issue a “healthcare customers” query over a large number of data sources. Although the search will return results that use the word “healthcare” or some derivative thereof, the search would not return, for example, “Entity A” even though Entity A is a company in the healthcare industry. The search would also fail to provide a connection between Entity A and Subsidiary B, even though the former acquired the latter. As data increases in size and complexity, and as the number of data sources multiply, a simple keyword-based search will provide far more results than are easily managed.