When a search engine performs a search of web pages on the internet, the search engine uses an index of web pages to determine those web pages that match the search terms entered by a user. A problem with current indexing methods, is that the indexing is typically performed across an entire document (i.e. web page) so that a web page will be determined to be a match if the entered search terms appear anywhere in the document. Often, when a user enters multiple search terms, the search results will include documents in which the multiple terms are found but in unrelated parts of the document. This problem is exacerbated because the raw markup content of a document observed by the search engine may locate words of unrelated areas, such as side menus and the like, in proximity to content of the primary material of the document, thereby reducing the effectiveness of search term proximity searches. This can lead to content being erroneously related in the index and to non-meaningful content, such as link lists etc being included in the index.
What is required is a system, method and computer readable medium that is able to provide improved document indexing.