Worldwide Web (WWW) sites are commonly known as "servers" while the computers that are used to access information stored on the servers are commonly known as "clients." A WWW site on the Internet produces a chronological log of requests from clients on the Internet. The protocol of the WWW is stateless, meaning that there is no sustained connection between the client and server. Typically, a WWW site contains numerous "pages" of information, and all the pages are interconnected by hypertext links to form an directed graph of pages.
In most instances, a client will first access the "home page" of a WWW site, and then will access a sequence of pages at the site via the hypertext links between those pages. In addition, each page at a WWW site can include references to numerous files, typically image files, that must be downloaded onto a client computer (i.e., a user's computer) before the page can be viewed. When a client computer requests a page that it has accessed via a hypertext link on another page, the "browser" software used on the client computer to access the WWW site will first download that page to local memory. Next it determines from the information In the requested page what additional files, if any, it needs from the server in order to complete the generation of the image for the page. It then downloads the additional files needed to complete generation of the image for the page, and then finally generates the page image for the user to view. Thus, to view a single page, the browser running on the client computer may request and download numerous files from a WWW site. Therefore, the number of object requests in the WWW site's access log file (often called "hits") will typically greatly exceed the number of distinct client sessions in which clients are accessing information from the WWW site.
The widespread use of gateways makes the access logs of WWW sites even less accurate, in that the access requests from numerous clients are routed through one or a small number of gateways to the WWW site. As a result, requests listed in a WWW site's access log where the gateway is the listed requestor may actually represent numerous distinct client sessions, even though the requests all come from the same client computer (the gateway) and even if those requests are received over a relatively short period of time that would otherwise be consistent with a single client session.
A second phenomenon, called caching, further reduces the accuracy of WWW site log files. Caching is the temporary storage of the files for recently accessed WWW pages. Most gateways make extensive use of WWW page caching so as to reduce the number of object requests that need to be issued to WWW sites, especially during peak usage periods when many users of the gateway are requesting the same WWW pages.
Statistically, certain WWW pages tend to be much more popular than others, especially the home pages of popular Web sites, and those pages tend to be cached by gateways, as well as other computers, thereby greatly reducing the number of object request entries in the corresponding Web site log files compared to the number of client sessions actually accessing those popular pages. Note that many client sessions handled by gateways will access Web pages from gateways' caches, and thus the true entry point object requests for those sessions will often not appear in the corresponding Web site log files. Furthermore, for client sessions that access only the most popular pages at a Web site during peak usage periods, it is quite possible that all the accessed pages will be cached pages, and thus no object requests at all for such client sessions will appear in the corresponding Web site log files.
Web page caching is performed not only by gateways, but also by many local area networks as well as by individual desktop computers. As with caching by gateways, such caching tends on a statistical basis to reduce the number of log file entries for the most popular Web pages.
It is an object of the present invention to convert Web site log files into expanded log files that compensate, on a statistically accurate basis, for object requests not included in the site log files.
It is another object of the present invention to generate expanded Web site log files that accurately represent the relative distribution of actual client requests for the different objects at a Web site and to thereby overcome the log file inaccuracies caused by object caching by gateways, network servers and other computers.
It is a further object of the present invention to assign object requests in the expanded log files to synthesized client sessions so as to represent, in a statistically accurate manner, the number of client sessions accessing a Web site and the distribution of objects accessed by those client sessions.
Another object of the present invention Is to generate analyses of Web site usage based on an expanded log file that represents in a statistically accurate manner the information access patterns of the clients of the Web site.