1. Field of the Invention
The present invention relates to searching for content on computer systems and networks. More particularly, the present invention relates to a system and method for searching for Internet-accessible content.
2. Related Art
The Internet represents the largest interconnected network of computer systems and smaller networks presently in existence. Since its inception, the number of nodes connected to the Internet has increased dramatically. As a result, a tremendous amount of content is currently available on the Internet, and is hosted by a multitude of websites. Collectively, these websites comprise the worldwide web (a.k.a., the “web”).
The ability to quickly locate relevant content on the Internet is a paramount concern for many users. To address this need, various web search engines have been implemented, each utilizing proprietary search methodologies and algorithms. One example is a search engine which includes a web crawler and associated technology to traverse, collect, parse, index, compress, and store content from the web. Users can query the search engine using one or more search terms, and are presented with links to websites offering content that the search engine determines as being relevant. The search results are generated using the “PageRank” algorithm, wherein the search results are generated by simple text matching and are prioritized according to the number of websites that link to a given website, thus indicating the popularity of a website. In the PageRank algorithm, pages of content on a website, and, sometimes, intra-page content, are represented as nodes in the network forming the web. The more incoming links (or connections) that a node has, the higher the rank that is associated with the node. The search engine also allows a user to view its cache of content for a particular website. Using the cache, the search engine highlights portions of retrieved content that match the user's search string query. This allows users to quickly scroll down through the page and to find areas of interest.
As web page sizes have grown, it has become increasingly difficult to locate a search string in the search results returned by existing search engines. Often, a user is forced to do a secondary search using local web browser's search capability. Further, content cached by search engines is often outdated by several days or weeks. Since existing search engines rely on a repository of cached content which may be out of date, search results often do not accurately represent all relevant content available on the web. Moreover, existing search engines cannot adequately access “dynamic” content, which includes content that is not stored (“static”) on a website and which is created when a user visits a website. Still further, existing search engines cannot adequately access “deep” content, e.g., database content that is accessible on the web, but is not stored in hypertext markup language (HTML) format. The content in such databases is typically accessed by manual generation of the user query. In response to the user query, an on-demand web page is then dynamically generated and presented to the user. This page may or may not have HTML links to other content deep in the database. Since these pages do not exist prior to the query, nor do they survive long thereafter, they present special problems for automated web “crawling” algorithms implemented in most search engines. As a result, the content is essentially invisible to popular search engines.
Another shortcoming of existing search engines is their inability to adequately track user feedback, and in particular, user satisfaction with search results. One attempt to track user feedback is a toolbar that can be installed for use with a web browser, and which allows for “click-through” measurement of user activities. This system also allows users to vote on the page being viewed. In particular, toolbars track user browsing habits by tracking the websites that users select after being presented with a webpage of search results by a search engine. However, existing toolbars suffer from a number of drawbacks, such as a lack of secure communications when votes are cast, as well as inadequate communication and interaction with the user as to the subject being voted on.
The collection of user information by search engines is known in the art. However, what is often not made clear to users is the specific data being collected, when it is collected, how it is transmitted, where and to whom it is transmitted, how it is used, and how long it is stored for potential future reference. Among other drawbacks, each user search query is recorded along with the user's IP address. Increasingly, users around the world are concerned about privacy while surfing the Web (including, for example, the collection of personally-identifiable information).
Other techniques for tracking user feedback are known in the art. One example is the “cookie,” which consists of a file generated by a website and stored locally on the user's machine after visiting the site. The cookie stores information about a user's web browsing activities, and can later be accessed by the same website, which stored the cookie on the user's system. Unfortunately, cookies do not allow a search engine provider to adequately gauge users' satisfaction and feedback regarding search results. Another technique for tracking user feedback involves allowing a user to save or bookmark sites that have been visited, and tracking such bookmarked information as an indication of the user's satisfaction with certain types of content. Still another technique relates to “click-through” measurement, wherein a user selects a specific search result, clicks on it, and this action is measured as an indication of relevance. However, these approaches rely on inferences as to the user's judgment based on browsing behavior, which is often inaccurate and incomplete.
Existing peer-to-peer networks also suffer from significant drawbacks. In particular, these networks require better search query routing abilities, and they presently lack the ability to permanently cache content at a local server (to help improve content availability and to avoid excessive file transfer times and bandwidth consumption). Additionally, there is a general need to provide an integrated search capability across the worldwide web, structured query language (SQL) based relational databases (e.g., the “deep” web), and peer-to-peer networks. At present, there is an excessive reliance in existing search engines on a single indication of relevance (e.g., PageRank), combined with a nearly myopic view of the worldwide web, which causes new, quality content to stay hidden from search engines for too long. As a result, it is almost necessary for webmasters to “game” the system to gain visibility for new content, while, at the same time, it is easier for low-quality content to spam the Internet. As a result, older, existing content that has had time to collect incoming links is given an advantage (several orders of magnitude) over new, potentially higher-quality content that no one has seen. Therefore, no one has link to such newer content, and it has no discernable PageRank. This self-fulfilling characteristic of PageRank offers a skewed, non-quality-based view of the web to users. This, combined with the difficulty that search engines have in adequately crawling the rapidly-growing Web, leaves room for improvement.
Accordingly, what would be desirable, but has not yet been provided, is a system and method for searching Internet-accessible content, which addresses the foregoing limitations of existing search engines.