Basic web-based content searching techniques are well known. Common examples are readily visible in publicly available Internet searching portals. With the organic growth of content on the Internet, searching techniques are only as good as ability to prioritize or sort document identifiers (e.g. description data, e.g. abstract, and the hyperlink). Additionally, the vast breadth of searchable content is searched by a limited number of search terms, typically relatively basic terms, thus compounding the relevance concerns when returning search results.
Existing search result generation techniques recognize and incorporate generalized relevance aspects when sorting and prioritizing search results. The sorting and prioritizing is typically a precursor operation to the generation of a search results page, where the search results page includes, among other things, hyperlinks and abstracts briefly describing the documents found in the search results. For example, a first search results page may be the first twenty-five document identifiers as sorted and prioritized by the search engine, with each hyperlink including an abstract. Various engines may use different techniques for sorting and prioritizing the content. The search results page may be one of any number of pages, either limited by the number of search results or system-limited to show only a set number of results, for example the first 500 results.
In existing techniques, the relevance score of a document is calculated solely based on attributes of the document and the query, such as term statistics, site authority, document-query similarities, etc. The term documents, as used herein, refers generally to any suitable type of content that is accessible and viewable through the Internet, including HTML-encoded documents, proprietary-encoded document (e.g. PDFs), audio and/or video files, images, etc.
Existing techniques fail to take into consideration abstracts included with the hyperlinks. The existing systems make the implied connection that a user's selection of the hyperlink relates to the underlying document, but in fact the user selection may more appropriately relate to the text of the abstract. The user may be making a hyperlink selection based on the content of the abstract indicating that the subsequent document contains the information the user is seeking.
Attempts have been made to automate text recognition and categorization as may be applied to the abstract, but these attempts have mostly failed or produce significantly poor performance. For example, one approach is a technique based on the Metadata Object Description Schema (MODS). This bibliographic schema was originally developed by the Library of Congress and has since been applied as an XML schema. Although, even using this defined schema is problematic because the schema defines relationships between various terms that may be found in an abstract, but fails to account for the underlying search term. In other words, the MODS technique may find relationships to between different terms, but this relationship is not put into any level of usable context for a search engine because it is not associated with search terms. Furthermore, the MODS technique is, at best, a schema and lacks specifics for implementation with search techniques.
As such, there exists a need for enhancing search results based on the relationship of terms in the abstracts of the document identifiers, relative to the user selection activities of the corresponding hyperlinks and also the corresponding search term used in the search result.