It has become common for users of computer-type devices to connect to the World Wide Web (the “web”) and employ web browsers and search engines to locate web pages (or “documents”) having specific content of interest to them (the users). As a result, an ever-increasing amount of data is becoming accessible online, and today the web already contains tens of billions of web pages. Many structured repositories for these web pages are also widely available, and it is useful to link information available in these various data sources. To this end, there are initiatives aimed at providing specifications for linking data objects from disparate data sources, and the corresponding research on composing information from multiple sources is already quite extensive. However, automating the task of linking information from various data sources is problematic largely because this abundance of data is being created in an ever-growing variety of forms and formats.
Current solutions to composing data from multiple sources include identifying similar structured records, linking text documents, and matching structured records to text data. Some of these solutions may also employ techniques to identify the object that is the topic of a review and then hypothesize a language model underlying the creation of reviews which is then used for finding the object (or objects) most likely to match the topic of a review. To improve results, such solution are often generalized to allow for attributes in structured records to have different weights and admit semantic translations of values, but such approaches are highly dependent upon good pre-categorization of documents and the structured records pertaining to same. Meanwhile, current work to linking text documents has focused on identifying mentions of phrases representing concepts (i.e., “named entities”) in one document and linking them to other documents that have in-depth information about those concepts. For such approaches, concept phrases are identified and disambiguated using rules, machine learning, or other information extraction techniques. As for matching text to structured records, most known approaches generally involve extracting structured data from text and then matching these structured records.
The foregoing techniques have been widely used often to algorithmically build structured databases, but methods for later extracting information from these structured databases often provide only limited accuracy unless also coupled with a substantial labeling effort. One proposed solution is to match concise text snippets to structured records, where the text snippets correspond to brief descriptions of merchant offers that need to be matched to the structured specifications in a product catalog, and thereby identify pieces of text that are identical to values in structured records and tag those records with the corresponding property names. The text is thus reduced to tuples of pairs, each pair comprising a value and a set of plausible property names, and a match between this representation of a text and a structured record is then scored by choosing the optimal property name for each value and checking whether the values are the same or different for identical property names of the two. However, these proposals tacitly assume that the text and structured records have been accurately classified in accordance with some taxonomy, and building good taxonomies and accurate classifiers is very difficult to achieve in practice.