1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to determining the veracity of data in a repository using a semantic network.
2. Description of the Related Art
The Internet is a globally accessible network of computers that collectively provide a large amount and variety of information to users. From services of the Internet such as the World Wide Web (or simply, the “web”), users may retrieve or “download” data from Internet network sites and display the data that includes information presented as text in various fonts, graphics, images, and the like having an appearance intended by the publisher. As the information revolution has exploded, more and more information is available through the Internet. However, finding particular pieces of information out of the millions of “web sites” available can be daunting.
One way of sorting through this mass of information to find what is of interest for a particular user is through the use of “search engines”. Search engines are software written to search, among the millions of web sites or large document repositories, for certain key words or search criteria entered by a user, and to return to the user a list of links (such as references to other HTML pages) to the sites or documents that the search engine determines to be most relevant to the criteria entered by the user. Different search engines use different methods of determining the relevance of the web sites or documents, but most use some sort of quantitative method that determines the relevance of a site or document based on how many times the search words entered by the user appear within that particular site or document.
Search engines typically return only a list of links of sites or documents which contain one or more references to the search terms entered by the user. Often times, this list does not necessarily contain sites or documents that are actually relevant to a search query. A user may have difficulty in finding a site or document that is actually relevant to the search query since existing search engines classify web pages and documents based on raw statistical analysis of the words in a page. This raw statistical analysis technique is often called the “bag of words” model. Using the “bag of words” model, existing search engines do not take into consideration the meaning of the words, or the significance of the relationships between concepts. While such existing search models are adequate for merely locating web sites or documents which contain one or more terms in a user's search query, these search models lack the ability to determine which of the documents located is most relevant to the search query.
In addition, search engines typically return data that is largely based on keyword matches and algorithms, and have no concern as to whether a document (or part of a document) contains out-of-date information. For example, if a geographic area that was previously represented by the ZIP code 11111 has been divided by the postal service into two smaller areas represented by ZIP code 11111 and new ZIP code 22222, the postal service will assign an address that is now located in the new area code the new ZIP code 22222. However, if the address (now located in the new area code) appears in multiple places on the web, searches for the address will likely result in web pages that contain the out-of-date ZIP code information 11111, as the owner of the address may not have the ability to update all occurrences of the address with the new ZIP code information.