1. Field
Embodiments of the invention relate to handling error documents in a text index.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page (which is a type of document). Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page. Thus, the Web may be described as a series of interconnected web pages with links connecting the web pages from different web sites together. A web site may be described as a related set of Web pages.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword search request (also known as a search request). For example, the search request “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
In a text indexing system, which fetches and indexes documents (e.g., Web pages from the Web) using a text index, there is potential for encountering documents with errors (also referred to as error documents). That is, many documents on the web have syntax errors that may cause a parser to ignore certain parts of those documents. Also, sometimes an incorrect data format is specified for a document, such as a binary word file masquerading as a plain text file. These errors could cause the documents in question to be indexed incorrectly. When a document is not indexed because of such an error, an administrator needs a quick and easy way to find out what happened during the processing of that document.
In particular, each document fetched is identified by a unique string called a URL. All URLs are assumed to be unique throughout the Web. If a document with the same URL is received later, it is considered an update of a document with the same URL received earlier. So, assume the text indexing system received four URLs: A, B, C, and D. Assume also that URLs A, C, and D could be parsed and indexed properly, while URL B contains an error that prevents it from being indexed. In a typical text index processing system, URLs A, C, and D are added to the text index, while URL B's error is written out into a log.
To find the status of a URL, an administrator would go to the text index to see whether that URL has been indexed. If the URL is not in the index (as would be the case for URL B), then the administrator would go to the log file to see there is any error for the URL. The drawback of this approach is that the log file may get large and also require maintenance to purge it of old records that are no longer applicable. For instance, if, at a later time, URL B is received and can be indexed without error, then the log file should be updated to remove the now obsolete error entry for URL B. On the other hand, if, at the later time, URL B is received again and a new error appears, then the log file should be updated to reflect the new error. Also, to save space, log files are often overwritten after a few days. Thus, the traditional method of logging the errors in an error file does not work well.
One alternative to a log file is to make use of a relational table for storing either for all the processing results or just the errors. With a relational table, modifications of the error data for documents that may have been updated may be handled easily because a relational database provides update capabilities. This approach, on the other hand, requires the presence of a relational database, and special code needs to be written for interfacing with the relational tables that are distinct and separate from the text index lookup. In addition, use of a relational table may have a negative impact on performance.
Thus, there is a need in the art for improved handling of error documents in an index.