1. Field of the Invention
This invention generally relates to the field of computer based search systems, and more particularly relates to a system and method for improving data quality in large hyperlinked text databases using pagelets and templates, and to the use of the cleaned data in hypertext information retrieval algorithms.
2. Description of Related Art
The explosive growth of content available on the World-Wide-Web has led to an increased demand and opportunity for tools to organize, search and effectively use the available information. People are increasingly finding it difficult to sort through the great mass of content available. New classes of information retrieval algorithms—link-based information retrieval algorithms—have been proposed and show increasing promise in addressing the problems caused by this information overload.
Three important principles (or assumptions)—collectively called Hypertext IR Principles—underlie most, if not all, link-based methods in information retrieval.
1. Relevant Linkage Principle: Links confer authority; by placing a link from a page p to a page q, the author of p recommends q or at least acknowledges the relevance of q to the subject of p.
2. Topical Unity Principle: Documents co-cited within the same document are related to each other.
3. Lexical Affinity Principle: Proximity of text and links within a page is a measure of the relevance of one to the other.
Each of these principles, while generally true, is frequently and systematically violated on the web. Moreover, these violations have an adverse impact on the quality of results produced by linkage based search and mining algorithms. This necessitates the use of several heuristic methods to deal with unreliable data that degrades performance and overall quality of searching and data mining.
Therefore a need exists to overcome the problems with the prior art as discussed above, and particularly for a method of cleaning the data prior to a search and eliminating violations of hypertext information retrieval principles.