1. Field of the Invention
The present invention relates to page information collection programs, page information collection methods, and page information collection apparatuses for collecting page information from a web site, and particularly to page information collection programs, page information collection methods, and page information collection apparatuses for collecting page information controlled by a web application.
2. Description of the Related Art
If a web site is built, each page in the web site must be verified to see whether it has been created as planned. It would be hard to manually verify a massive web site or a web site having a complicated structure because of heavy use of web applications or the like. Therefore, systems have been designed to automatically verify a web site by performing automatic input to input fields in the web site and the like.
An automatic web-site verification system automatically collects information of each page in a target web site. Page collection is important if a web site is tested while the entire configuration of data and programs or web applications of the web site is not known.
One page may be recognized as a concrete target of an entry test only after page collection, for instance. Pages are often organized in such a manner that page B can be acquired just by using random data obtained on page A. A page that can be acquired only after log-in is an example. Each time page B is tested, page A must be acquired, so that a technology for automatically acquiring both page A and page B is desired.
A system has been provided to collect pages that can be referenced by following links from a given page in a web site. The user first enters information of a hyper text transfer protocol (HTTP) request for acquiring a base page. The system issues the HTTP request, analyzes an HTTP response, and creates an HTTP request group just from a group of new link information found in link information groups included in the HTTP response. The processing of issuing a request, analyzing a response, and creating a request group is repeated until all the HTTP request groups are issued. The response analysis and the subsequent processing can be cancelled for a page which can be reached by following a given number of links from the base page (refer to U.S. Pat. No. 6,584,569).
In page collection from a web site including web applications, these trade-offs must be considered: If a great number of links are followed, an enormous number of pages must be collected, providing an excessively heavy processing load; if a reduced number of links are followed, a great number of pages are missed, lowering the reliability of web-site verification.
Redundant page collection is avoided by following just a link appearing for the first time in the link information included in the collected page information. Whether a certain link is found for the first time is determined by comparing a combination of a uniform resource locator (URL) and a parameter including a query parameter, for instance. If the URL and the common-gateway-interface (CGI) query parameter of the target link information match those of the link information of a page acquired before, it is determined that the link information has ever been followed. A page indicated by the old link information will not be collected.
If the conformity of link information is judged just by an exact match of the combination of a URL and a parameter including a query parameter, a great number of similar pages would be collected. Suppose that a scheduler web site uses link information such as /foo.cgi?date=1 and /foo.cgi?date=2 to display user's timetable of a given day. The date is specified as the value of the query parameter in the link information. Because the pages for displaying the timetable of a day have the same structure, the pages of all dates need not be acquired. However, if exact matching for a combination of a URL and a parameter including a query parameter is performed to judge the conformity of link information, the pages of all dates would be acquired. As a result, a great amount of unnecessary page verification would decrease the processing efficiency of the system.
If the query parameter value is not compared, the pages of all dates will not be acquired with link information such as /foo.cgi?date=1 and /foo.cgi?date=2. This, however, can prevent a page having a different structure depending on the query parameter value from being collected even though such page should be checked.
Suppose that the link information to a page for viewing a specified timetable is /bar.cgi?action=view and that the link information to a page for editing a timetable is /bar.cgi?action=edit. The view page and the edit page have different page structures and must be collected as different pages to be verified.
If the link information is compared not in terms of the query parameter value but in terms of the combination of the URL and the query name, /bar.cgi?action=view and /bar.cgi?action=edit are assumed to be the same link information. Just either the view page or the edit page is acquired, and the other page that should be verified is missed.
Accordingly, a system which can collect all pages that should be checked and can minimize redundant collection of pages having identical data structures has been awaited.
Reacquisition of an identical page may be required in automatic web-site verification. A system for reacquiring a page stores the HTTP request issued for page collection, for instance. When the user specifies a page by entering an item such as a URL, the system issues the request that was used to acquire the page. Then, the system receives a response to the issued request and outputs the response.
If a significant HTTP request is output just by sending a plurality of HTTP requests in a given procedure (transaction processing, for instance), the system for reacquiring a page cannot reacquire a correct page. The system cannot automatically recognize the failure of page reacquisition and cannot automatically locate the request causing the failure. Consequently, manual verification must be conducted, putting an excessive load on the user.