In order to provide results in search engine and directory services, these services must index web pages. Improving the accuracy of the indexing process has long been a focus of these search engine and directory services. In particular, improving the signal-to-noise ratio in web pages that are indexed by search engines and directory services is a goal of these services.
One problem involved in the field of search indexing is how to separate extraneous material from the indexed page, such as advertisements, site navigation components, headers, footers, and copyright notices, and so on. In other words, the problem is how to remove context in a web page so that only content is being considered for indexing purposes. Eliminating noisy context improves search results of indexed pages by removing from consideration irrelevant content or content that could produce erroneous results. For example, a search for the term “navigation” related to the field of orienteering may result in many web pages where the term “navigation bar” appears, which theoretically could be a very large number of returned search results.
One current solution to this problem is to parse the document being indexed into discrete semantic units. These semantic units are then analyzed independently to separate content that is relevant for indexing purposes from content that is not relevant. The above solution, however, may not be economically feasible for processing large numbers of documents as it takes time and resources to semantically walk through each individual document to identify irrelevant context. Furthermore, it may result in content being identified as irrelevant when it is actually relevant, depending on the criteria used to identify irrelevancy.
As a result, a mechanism to remove noise provided by context that is not relevant to the content of a document in an efficient and economically feasible way would be beneficial.