Data cleansing and gathering software is well known. Applications such as person merging software, can determine household statistics from lists of names and other information as well as merging persons with different name variations who are in fact the same person. Such applications can also be applied to business and company information. Various internet websites also attempt to collect information from a number of different websites to present gathered data. However, these sources are typically used in the general context and for vague informational purposes only as the relevance and accuracy of the gathered information are not considered beyond a superficial level.
With an increase in the volume of information that can be obtained as well as an increase in the variation of the sources from which such content can be obtained, there is an increase in the need for such content to be normalized. Such sources can include newspapers, magazines, blogs, social media, etc. Because information from these sources can be incomplete and inconsistent, the need for determining the relevance of content to specific contexts becomes imperative. There is also a need to determine when content can be merged.
It is an object of this invention to provide a novel method and system for content extraction and association.