Current web crawlers analyze websites without any notion of a particular user owning the website. Therefore, the information gathered from crawling a website is limited to how a web crawler is configured to examine the website. For example, a web crawler may be configured to identify certain key words and/or phrases on websites. A web crawler may also be configured to analyze the structure of the text and graphics on websites in order to obtain a more accurate understanding of the content of the websites.
In addition to understanding a website, a web crawler would prefer to be able to authenticate a site as special, such as associating the site with an owner. In such a situation, the web crawler may provide the owner of a website with useful and private information about the website, such as where website traffic is coming from, what sites link to the website, “health” and errors of the website, etc. However, it is important that such private information is not shared with a competitor of the owner or some other imposter because the private information in possession of others may put the owner of the website at an unfair advantage. In this particular case, and in the more general case of authenticating the site as having special attributes, it is important to have a secure authentication mechanism to prevent malicious spoofing.
It is possible to verify that a user is the owner or an authorized representative of a website by adhering to the following procedure. As an example, a user might want confidential information pertaining to XYZ.com. First, the user may initiate a session, e.g. via a browser, with the entity that owns a particular web crawler. In the session, the user claims that the user owns or is at least authorized to modify XYZ.com. Second, the entity provides a filename to the user, such as “filename314159265”. Third, the user creates a file on the website with the filename and then notifies the entity. Fourth, the website XYZ.com is searched (e.g. by the web crawler) and the web crawler determines whether a file with the filename of “filename314159265” exists on the website. The web crawler will know the file is not found if a 404 error message is sent to the web crawler. The 404 or Not Found error message is an HTTP standard response code indicating that a client (i.e. web crawler in this example) was able to communicate with a server hosting the website, but the server either could not find the file that was requested, or it was configured not to fulfill the request and not reveal the reason why. If a 404 error message is returned to the web crawler, then the web crawler does not trust the user and will not provide confidential information about XYZ.com to the user.
If the web crawler does not encounter a 404 error message, then that may be interpreted as an indication that the file with “filename314159265” as the filename is stored on the website. Consequently, the entity is confident that the user owns the website and/or is authorized to make modifications to the website. As a result, the entity may provide confidential information to the user about the website.
However, a problem exists when following the above approach. Many web servers are configured to not provide a 404 error message even if a file is not found on the website as long as the domain name in a URL is correct. Instead, such web servers return a 200 response code (which indicates that the request for the file has succeeded) with accompanying text that states that the requested page was not found. This web server response is known as a “soft 404”. Because the web crawler received a 200 response code, the entity may mistakenly believe that the user is authorized to modify the website and consequently provide confidential information about the website to the user.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.