Although the origins of the Internet trace back to the late 1960s, the more recently-developed Worldwide Web (“Web”), together with the long-established Usenet, have revolutionized accessibility to untold volumes of information in stored electronic form to a worldwide audience, including written, spoken (audio) and visual (imagery and video) information, both in archived and real-time formats. The Web provides information via interconnected Web pages that can be navigated through embedded hyperlinks. In short, the Web provides desktop access to a virtually unlimited library of information in almost every language. The Web has proven particularly helpful in facilitating on-line shopping by providing easy access to helpful information and to resources often unavailable in a conventional “brick and mortar” store.
Search engines have evolved in tempo with the increased usage of the Web to enable users to find and retrieve relevant Web content in an efficient and timely manner. As the amount and types of Web content has increased, the sophistication and accuracy of search engines has likewise improved. Search engines strive to provide responsive and quality search results. Determining quality is difficult, though, as the relevance of retrieved Web content is inherently subjective and dependent upon the interests, knowledge and attitudes of the user.
News messages available via the Usenet are cataloged into specific news groups and finding relevant content involves a straightforward searching of news groups and message lists. Web content, however, is not organized in a structured manner, such as by providing labels, clusters or categories that map Web content by shared property or meta characteristic. Search engines have evolved to help users find and retrieve relevant Web content, as well as news messages and other content types. Existing methods used by search engines are based on matching search query terms to terms indexed from Web pages. More advanced methods determine the importance of retrieved Web content using, for example, a hyperlink structure-based analysis, such as described in S. Brim and L. Page, “The Anatomy of a Large-Scale Hypertextual Search Engine,” (1998) and in U.S. Pat. No. 6,285,999, issued Sep. 4, 2001 to Page, the disclosures of which are incorporated by reference.
Despite improvements in Web content searching, not all Web content is equally retrievable. For instance, some types of Web content are esoteric and may be referenced so infrequently that relatively few hyperlinks are available for a search engine to identify and exploit. Similarly, other types of Web content, such as advertisements, are short-lived and can change frequently, often making retrieval a matter of timing, rather than based on quality of match. Still other types of Web content, especially advertisements, are highly repetitive and duplicate a significant amount of content between individual Web pages.
One approach to searching poorly retrievable Web content resorts to basic text matching. Those types of Web content that tend to yield poor quality search results due to few hyperlink references, short duration or highly repetitive content, are grouped into a separate search corpus. Search results are then identified from the search corpus based on the quality of matching of search query terms to individual documents. The search results having the most text matches can be scored or ranked in quantitative terms by relative goodness of match.
Although text matching may yield relevant results, basic text matching suffers several drawbacks. First, the search query terms are treated in literal fashion and other relevant Web content may be overlooked or omitted. Similarly, search query terms or phrases may have different senses, which can result in an ambiguous search query. The score or rank only quantitatively reflects goodness of match and not quality of match. For example, a search engine could identify several documents in response to a search query requesting, “35 mm Camera.” However, only those documents substantively relating to particular camera models, versus camera accessories or film supplies, would be qualitatively better matches.
Therefore, there is a need for an approach to qualitatively scoring Web content identified through text matching based additionally on associated and weighted categories. Preferably, such an approach will score both the identified content and individual search query for quality of match to the categories.