The World Wide Web is structured as a “two-party” system, in which a first party, the computer user, receives content from a second party, the Web server. The user typically requests the content in the form of mark-up language documents, such as Web pages written in HTML. In order to retrieve the desired Web page, the user submits a particular URL (uniform resource locator) to the Web server, which retrieves and transmits the desired Web page to the computer of the user. However, the user must know the correct URL, or else the Web page cannot be retrieved.
Since there are many Web pages available through the World Wide Web, search engines have evolved to assist the user in the search for a particular Web page. These search engines index Web pages according to one or more keywords, such that when the user submits the query for a particular Web page, those Web page(s) with the same or similar keywords as for the query are retrieved. Search engines may receive Web pages (or pointers to those Web pages, such as URLs for example) by submission from the author of the page(s), but the search engines also actively search for new Web pages. Typically, such active searches are performed automatically with autonomous software programs called “spiders” or “crawlers”.
These autonomous software programs search through the World Wide Web by extracting links from known Web pages in order to locate new Web pages, to which the links point. As each new Web page is located, it is indexed and added to the database of the search engine, and new links are extracted from that Web page. Search engines use the URL as a unique identifier of the indexed page. Thus, the autonomous software programs depend upon two assumptions. First, the Web pages existing as static entities, to which links remain stable. The second assumption is that web pages have incoming links pointing to them.
However, many Web pages today are provided as dynamic Web pages, which are created in real time or “on the fly” from a plurality of components stored in a database. Dynamic Web pages are created upon submission of a query by a user, which determines the identity of the components to be retrieved and assembled into the Web page. For example, a URL for a dynamic Web page, if it exists, may appear as follows: http://domain.com/search.asp?p1=v1&p2=v2. The term “search.asp” is a name of an application which should be invoked, followed by a “?” sign, and a list of parameters and their values. Many autonomous software search programs are designed to ignore such links, since automatically following this type of link may cause an infinite recursion which the autonomous software program cannot properly handle. Thus, dynamic Web pages are often not indexed (by using filters to reject such Web pages automatically during the indexing process), or even “un-indexable” due to the fact that the only way to generate this page is by submitting a query through a form and not through a regular hyperlink used by search engines to locate new pages.
Content from Web pages may be extracted for direct submission to a search engine, for example through a direct feed mechanism. Various search engines now receive data through such a direct mechanism, such as AltaVista™ for example. Typically, each such search engine has a specification for determining the format in which the data should be received. Most search engines require the data or “feeds” to be transferred as an XML file, but other formats could also be used. Typically, the feeds include the following information per Web page: information that will be displayed in the search results: title, short description, link URL (the link behind the title) and display URL which appears under the description; and information that will be indexed but not displayed, such as meta keywords and the content of the page.