Technical Field
The present invention relates to data annotation and, in particular, to systems and methods for annotating data elements based on heterogeneous knowledge bases.
Description of the Related Art
Every day, businesses accumulate massive amounts of data from a variety of sources and employ an increasing number of heterogeneous, distributed, and often legacy data repositories to store them. Existing data analytics solutions are not capable of addressing the explosion of data, such that business insights not only remain hidden in the data, but are increasingly difficult to find.
Keyword search is the most popular way of finding information on the Internet. However, keyword search is not compelling in business contexts. Consider, for example, a business analyst of a technology company, interested in analyzing the company's records for customers in the healthcare industry. Given keyword search functionality, the analyst might issue a “healthcare customers” query over a large number of repositories. Although the search will return results that use the word “healthcare” or some derivative thereof, the search would not return, for example, Entity A even though Entity A is a company in the healthcare industry. Even worse, the search will return many results having no apparent connection between them. In this case, it would fail to provide a connection between Entity A and Subsidiary B, even though the former acquired the latter.
Although many repositories are available, the techniques for correlating those heterogeneous sources have been inadequate to the task of linking information across repositories in a fashion that is both precise with respect to the users' intent and scalable. Extant techniques perform entity matching in a batch, offline fashion. Such methods generate every possible link, between all possible linkable entities. Generating thousands of links not only requires substantial computation time and considerable storage space, but also requires substantial effort, as the links must be verified and cleaned, due to the highly imprecise nature of linking methods.