The present invention relates to a system and method for maintaining up-to-date Web page link information in a metadata repository that is part of a Web search engine.
The World Wide Web service (Web) of the Internet is an increasingly popular tool for communicating information. The volume of information available on the Web is so great that users seeking information require help, in the form of search engines. Conventional search engines examine available information, such as Web pages and files, and generate indexes relating search terms to links (URLs) which point to the information. All search engines face the challenge of keeping their indexes current. Indexes go out of currency because Web authors often move or remove files from previous locations, which have been indexed. Attempts to follow links to these pages or files result in error.
Due to such changes, which are constantly occurring, it is imperative for search engines to constantly update their indexes to avoid returning, as a result of a query by a user, URLs that reference datasources (documents) that are missing or have been moved. Conventional methods of updating search engine indexes tend to be time-consuming and costly to perform. A need arises for a faster and less expensive technique for updating search engine indexes.
An additional problem arises when a search engine maintains link structure information. Link structure information may be maintained for several purposes, such as for generating rank information of search results. Rank information is useful to the user because it allows the search engine to present query results in order of those links that are more likely to be relevant to the query. Typically, link structure information is stored in a metadata repository. In this situation, the search engine must also update all of the metadata for pages which contain links to or were linked from the outdated URL. This additional required updating is even more time consuming and costly. A need arises for a technique by which search engine indexes and link structure information can be updated more efficiently than with prior techniques, which will provide time and cost savings.
The present invention is a system and method for updating search engine indexes and link structure information that is more efficient, less time-consuming and less costly than prior techniques. The present invention provides improved updates to metadata link information for Web pages which have been permanently moved or have been deleted. In addition to using the database architecture present in the search engine, the present invention takes advantage of the RDF format of the metadata. For a search engine which indexes structural as well as textual information, the present invention provides an efficient way to keep the metadata repository up-to-date without having to download and recrawl all the pages. Only the response code and location are required from the server. The database link table serves as a solid reference to the parent-child structure of the engine""s domain. The present invention takes advantage of this resource to maintain and update rich summaries of Web data. The present invention is also extensible to handle other URL status changes.
The present invention is a system and method for updating search engine information. In order to carry out the method a uniform resource locator indicating a Web page for which the search engine information is to be updated is selected. A server on which the indicated Web page is located is contacted to obtain the Web page. A response code indicating a status of the Web page is received and the search engine information is updated based on the response code.
The response code may indicate that the Web page cannot be found, and the updating step may comprise the step of deleting information relating to the Web page from the search engine information. The response code may indicate that the Web page has been moved, and the updating step may comprise the step of modifying information relating to the Web page that is included in the search engine information.
In order to carry out the deleting step, a plurality of parent uniform resource locators related to the selected uniform resource locator are received. All instances of the selected uniform resource locator are deleted from the search engine information. Metadata summarizing Web pages is updated, the metadata referencing the selected uniform resource locator and metadata summarizing the Web page indicated by the selected uniform resource locator is deleted.
In order to carry out the step of updating metadata summarizing Web pages, existing RDF summaries for each parent uniform resource locator in the search engine information may be modified to remove references to the selected uniform resource locator and annotation information from its list of out-links. Alternatively, the step of updating metadata summarizing Web pages may be carried out by resummarizing metadata information for each parent uniform resource locator in the search engine information to create new RDF summaries with updated information.
In order to carry out the modifying step, a uniform resource locator indicating a new location of the Web page indicated by the selected uniform resource locator is received. A plurality of parent uniform resource locators and child uniform resource locators related to the selected uniform resource locator are received. All instances of the selected uniform resource locator are replaced with the uniform resource locator indicating the new location of the Web page. Metadata summarizing Web pages is updated, the metadata referencing the selected uniform resource locator. The uniform resource locator indicating the new location of the Web page is crawled to update metadata summarizing the Web page and metadata summarizing the Web page indicated by the selected uniform resource locator is deleted.
In order to update metadata summarizing Web pages, existing RDF summaries for each parent uniform resource locator in the search engine information may be modified by replacing the selected uniform resource locator with the uniform resource locator indicating the new location of the Web page with the new URL among the plurality of parent uniform resource locators and child uniform resource locators related to the selected uniform resource locator. Alternatively, metadata summarizing Web pages may be updated by summarizing each parent uniform resource locator and child uniform resource locator among the received parent uniform resource locators and child uniform resource locators to create new RDF summaries.