Today, various services are provided via networks, especially by the Internet. Various types of client devices that can be connected to the network range from high performance computers such as a personal computers and work stations to simple terminals such as mobile phones and personal digital assistants (PDAs) that are suitable mainly for receiving, displaying, and performing simplified processes. Therefore, many web sites (web servers) dynamically create different kinds of content to provide services according to the capabilities of connected client devices or user agents.
A web server that returns web content according to the type of the client device can identify the type of the client device by referring to a User-Agent HTTP header in the HTTP (Hypertext Transfer Protocol) request from the client device. Using this mechanism, the server can create and return web content that is tailored according to the capability of the client.
Tailoring the content according to the capability of the client device means that, for example, the content is segmented according to the client's memory capacity (i.e., the size of data that can be read at one time) and downloaded in multiple segments, or the content of text is altered to exclude images for client devices without the capability to display images.
It is now commonplace to use a search engine or search site to select and acquire desired content from the enormous storehouse of information provided by the world wide web. The search site acquires in advance and holds information about gaining access to the content of the web and returns the information to the client upon request. One of the principal search sites today uses a robot for automatically gaining access to a site on the network while following a hyperlink (hereinafter referred to as a link) to the content, collecting the information for accessing the content at the site, and responding to the search request from the client device. For example, such systems are described in Japanese Published Unexamined Patent Applications No. 2002-215642 and No. 2002-259432.
FIG. 17 is a block diagram showing the configuration of a robot type search site. As shown in FIG. 17, the robot type search site 1710 comprises a web-crawling robot 1711 for automatically gaining access to a site 1720 on a network to acquire information for accessing the content at the site, a database 1712 for registering and accumulating the information acquired from the web-crawling robot 1711, and a search engine 1713 for searching the database 1712 by accepting a search request from a client device 1730 and returning the search result.
As described above, at the robot type search site, the web-crawling robot gains access to the site on the network, and acquires the information (e.g., URL (uniform resource locators)) for accessing the content. However, when the web-crawling robot gains access to a site that dynamically creates content according to the capability of the client, the site recognizes the access as being from the web-crawling robot, and responds accordingly. Unfortunately, this means that the web-crawling robot cannot acquire all of the different adaptations of the information potentially provided by the site according to the clients' capabilities. Also, some sites may invoke an error processing routine for the unknown type of access, or just assume an appropriate type. These actions may degrade the reliability of the search site.
Assume, for example, that a site S1 has content C1 of size 5 KB, which may be sent in two ways. The first way is to send the 5 KB as a single block to a client device A that has a maximum readable data capacity of, for example, 6 KB. The second way is to segment the content and send the segments sequentially as c01, c02 and c03, of 2 KB, 2 KB, and 1 KB, respectively, to a client B that has a maximum readable data capacity of 2 KB.
At the time of the search by the web-crawling robot, the site S1 determines that the web-crawling robot is the client device A. Thus, only the content C1 having a size of 5 KB is registered in the database at the search site S2.
Now suppose that the client device B retrieves the content associated with a predetermined search term at the search site S2, and hits the content C1. When the client device B sends an HTTP request to the site S1 in accordance with this search result, the content c01 that is one of three segments is returned from the site S1, as described above. At this time, if the hit search term exists in the content c02 or c03, the client device B cannot obtain the desired information, even though it has made its access on the basis of the search results at the search site S2.
The aforementioned Japanese Published Unexamined Patent Application No. 2002-259432 discloses a technique for analyzing the substance of the content to calculate and evaluate the degree to which the predetermined content is suitable for the client device, and choosing the search result that is most appropriate for the type of client device to the extent possible. However, the disclosed technique does not allow for the acquisition of content in consideration of the various types of client devices when the web-crawling robot collects the content.