The amount of information that is available on the Internet is massive, and is growing at an ever-increasing rate. In fact, by some estimates more than 670 Exabytes (670,000,000,000 Gigabytes) of accessible data was stored on the Internet at the end of 2013. This information is largely unstructured, being stored in the form of Web pages of the World Wide Web, blog posts, micro-blog posts, etc., and being created by different entities at different times and for different purposes. Unfortunately, the sheer volume of the information that is available makes it exceedingly difficult to locate specific information that is of interest to a particular user at a particular time. As such, the user must use a search engine to form a search query, review a list of search results that is returned in response to the search query, view the search results that appear to be relevant, and then review in greater detail those search results that are judged to be most highly relevant. Of course, at some point the user may decide to form a new search query if the previous search results do not appear to be particularly relevant, which leads to a time-consuming trial-and-error approach to information retrieval. It would be beneficial to the user if a search query returned a set of search results including only the information sources that are most relevant to the user.
Search engines do not perform full text searching of web pages and blogs etc. every time a search query is received from a user, since the massive amount of information that would need to be searched makes this approach unfeasible. Instead, search engines maintain an index of keywords and of the locations where those keywords can be found. Such indexes are created using “spiders” or “webcrawlers” to search the text of Web pages for the occurrences of the keywords, as well as following links to other Web pages that are referenced in the Web page, etc. Subsequent searching of the indexed information becomes very fast, because performing the search merely involves looking up the search terms that are provided in a search query, and then retrieving a list of all the information sources that contain the search terms. Unfortunately, since any given combination of search terms is likely to be found in a very large number of Web pages, even “narrow” search queries can result in a very long list of information sources.
As is apparent, in order to be of any practical use to the user the list of information sources must also be presented in an order that is based on some measure of relevance. Early attempts to improve the relevance ranking of search engine results utilized the metadata that is part of a Web page source file. Metadata includes descriptive terms that are provided by the creator of a Web page, but that are not displayed as part of the Web page. Unfortunately, metadata is susceptible to abuse (spamdexing) by those wishing to improve the ranking of their Web page by including keywords that are likely to be included in popular searches, although the keywords may have little or no relevance to the content of the Web page. Such techniques undermine attempts to provide the most relevant information sources to the user.
Modern search engines typically rely on parameters other than Web page metadata to assign relevance rankings to search engine results. For instance, the frequency and location of keywords within a Web page, how long the Web page has existed and the number of other Web pages that link to the Web page in question all factor into the relevance determination. This approach assigns higher relevance rankings to pages that are deemed more relevant or more popular, and is based on the assumption that if other users found a particular Web page to be relevant then it is more likely that future users will also find the same Web page to be relevant and/or authoritative. That being said, one problem with this approach is that individual search terms, in aggregate, do not imply a specific or intended context or meaning of the search query itself, and as a result the search query is inherently ambiguous. For instance, a search term such as “apple” may be intended to refer to the fruit in one search query but intended to refer to the computer and consumer electronics maker Apple® Inc. in another search query. Even a combination of search terms such as “rotten” and “apple” is ambiguous, since one search query may be intended to refer to the rotten fruit and another search query may be intended to refer to rotten customer service experiences at Apple® Inc. retail stores. Due to search “ambiguity,” it is common for search engine results to include information sources that are relevant to each of the different interpretations of the meanings of the search terms. Even if information sources that are very relevant to the user are ranked relatively high in the search result list, it is generally the case that these results are intermixed with other information sources that are not relevant to the user. Search ambiguity therefore complicates the user's task of locating the most relevant information sources, increases the time and effort that must be expended to find the most relevant information sources, and generally frustrates the user's attempt to locate the information that he or she wishes to find.
A different strategy for improving the relevance of search results that are provided to a user is based on the use of hashtags. Hashtags are widely used in social media services such as Twitter®, allowing users to tag posts and thereby facilitate the grouping and retrieval of posts relating to a specific topic. The Twitter® hashtag appears in the body of a tagged post, as a word with no spaces and is preceded by the pound symbol, e.g. “#recycled.” Once a hashtag has been created, it may then be used by anyone to tag any post, however community ire and usage policies tend to discourage the improper use of hashtags. Nevertheless, as evidenced by recent examples of hashtag campaigns that have gone awry, such as for instance McDonald's® Corporation's #McDStories hashtag campaign, the public nature of hashtags limits their usefulness for assigning relevance to the information that is tagged therewith. The #McDStories campaign was intended to encourage users to talk about their past good experiences at the company's restaurants. Although the company itself posted a series of positive messages including the hashtag #McDStories, other users tended to post negative messages and the company lost the ability to control the overall message of the campaign.
Hashtags are public and cannot be “retired” from public usage, meaning that hashtags can be used in theoretical perpetuity depending upon the longevity of the word or set of characters used. They also do not contain any set definitions, meaning that a single hashtag can be used for any number of purposes as espoused by those who make use of them. Thus the hashtag #apple can be used for posts about apples and for posts about Apple® Inc. and again for posts about Apple Records, etc. This inability to control the usage of a hashtag after it is created often limits the usefulness of public hashtags for the purpose of identifying the context or meaning of information that is stored on the Internet. Either a hashtag is used in vast majority for a single reference and becomes useful, or is it used over time for diverse references and becomes difficult to disambiguate. Presently, hashtags are better suited for facilitating short-term discussions in social media applications, in which posts all relate to a common general topic but may focus on different aspects of that topic or provide different opinions and views.
It would be advantageous to overcome at least some of the above-mentioned disadvantages of the prior art.