As the use of the internet continues to grow, and the provision and use of electronic business solutions rapidly increases, the requirement for organisations to understand the effectiveness of their websites is growing in importance. While there are a number of techniques available for analysing site usage, it continues to prove difficult to be able to track an individual visitor to a website through the pages they visited, particularly for websites consisting entirely or primarily of static HTML pages.
There are significant reasons for wanting to track individuals' navigation within a Website, related to understanding the way the website is being used:
Firstly, by analysing the sequences of pages visited by each individual, a pattern of how visitors navigate through the site can be formed. This can be extremely useful in understanding why certain pages appear more popular than others. For example, it may be found that certain areas of the site are very rarely visited, and the visits to those pages are only made via tortuous navigation paths through other pages. This would indicate a problem with the website design which can be addressed to enable easier navigation to all parts of the site. Alternatively, it may be found that the rarely visited pages are found via a fairly direct route. This would tend to indicate either that the pages themselves are simply not of interest, or that the links to them are poorly worded or positioned, thereby failing to attract visitors.
By examining the common paths through the site, it may also be possible to identify different types of visitor. For example; expert users, casual browsers, people with a keen interest in a particular area and electronic crawler agents might all visit the site and have very different navigation patterns. By identifying these different patterns, modifications might be made to the site design to attract primarily those with a keen interest, perhaps through new navigation links from top-level pages.
Secondly, by analysing the associations between pages visited within a browsing session on a website, a picture of the types of visit can be formed. This might indicate general browsing, in which many top-level pages are visited but few pages containing any detail are accessed, detailed browsing, in which detail pages are accessed across the whole site, or specific information gathering, in which a particular area of the website is visited including much detailed information. Other patterns based on these may also be observed. By examining these patterns, the website owners can gain valuable insight into the reasons for people visiting the site, and perhaps whether those visits appear successful, by also examining the pages from which visitors exit the site.
A more detailed examination of page associations might highlight interesting correlation between parts of the site. For example, a financial services organisation's website might contain separate areas for corporate finance, domestic insurance, general financial advice and personal banking. By examining the associations between pages visited in a single session, it would be possible to find out what proportion of people using the personal banking services also accessed the general advice pages, for instance. Such insight into the way the site is used might both provide a better understanding of how the organisation should market its products and services, and enable improvements to the website design to allow better navigation between related areas.
Put together with the analysis of navigation paths, it would even be possible to determine that, for example, a significant number of visitors repeatedly jumped between the personal banking services and the financial advice pages to find definitions of terms they did not understand. By providing quick links to this information, the website could be made much more accessible to these visitors, thereby improving the marketability of the services.
The most common mechanism which is currently available for analysing website usage is through the examination of the server logs produced by a web server. These logs typically record the details of each request made on the server, in terms of where the request came from, what the request was and how it was responded to. This information would usually include:    Ÿ the IP address of the computer from the request was received,    Ÿ the URL requested and the method of the request (usually HTTP GET or HTTP POST),    Ÿ the date and time of receipt of the request,    Ÿ a response code (indicating ‘page served’, ‘page already cached’, ‘page not found’, ‘unauthorised access’ etc.),    Ÿ the number of bytes served in response to the request,    Ÿ optionally, depending on the web server configuration, the URL of the page from which the request was referred, i.e. the page from which a hyperlink was followed to make this request, for example,    Ÿ optionally, depending on the web server configuration, the characteristics of the computer on which the response will eventually be displayed, in terms of browser and operating system name and version.
While these server logs provide a lot of useful information about the pages served, number and type of failures and perhaps the computers being used to browse the Website, there are two major problems:    Ÿ Every single request coming into the Web server is logged, whether it be for an important page on the site, a minor page of little relevance or even an image to be displayed somewhere on an already served page. This means that the logs get very large and contain a lot of clutter which can hide the important information.    Ÿ More importantly, the IP addresses logged are those from which the Web server receives the request. In the vast majority of cases, people using the World Wide Web access it via a proxy server owned by their Internet Service Provider (ISP). Many other users access the Web via their employer company's proxy server. When examining server logs, therefore, certain IP addresses are repeated very regularly, these being the proxy servers of the most popular ISPs and, to a lesser extent, large employers. This implies that there is no information to tie any request to a particular computer on which the response will be displayed, and there is no means of tying any two requests together to say that these two requests came from the same Website visitor. Any analysis of site navigation as described above is therefore at best based on guesswork to try to match requests together, and at worst impossible.
There are two main ways of marking requests to identify them with a browser session.    Ÿ The first is to use ‘cookies’. These are server generated identifiers stored on the computer of the person browsing the site, which are sent to the server with each request. However, many Web servers are only capable of serving static HTML pages and cannot make use of cookies. For the majority of current Websites, therefore, the use of cookies has not been possible.    Ÿ The alternative is to use URL rewriting in which a unique session identifier is attached to each request as part of the URL. In order to make use of this technique, any hyperlink within a page must have a dynamic element which enables a session identifier, once generated, to be encoded within the URL of any request made by clicking on that hyperlink. For an existing Website which does not currently have this built in, this would involve considerable effort in modifying each page of the Website to enable it.
Therefore, for the majority of current Websites on conventional Web servers, no satisfactory solution is known for identifying and logging a sequence of requests to a Web server from the same browser. The available solutions require considerable effort to modify the Web site or the Web server.
U.S. Pat. Nos. 5,751,956 and 5,870,546 disclose a solution to the problem of tracking user selection of specific hyperlinks to remote servers, such as when a user clicks an advertising link within a displayed Web page to jump to the advertiser's Website, to measure the effectiveness of the advertisement. A significant problem when tracking links between different sites is that the server of a page which includes an advertisement hyperlink is typically not involved in a subsequent independent Browser transaction in which the advertised page is requested. Since no single server is involved in the full sequence of Website accesses, there is no server which is able to track the user's navigation between sites. This problem is solved by inserting specific modified hyperlinks into Web pages. A Web server provides to a client system a Web page which includes a hyperlink encoded with redirection and accounting data. When a user selects the hyperlink, the Web server receives from the client system a predefined URL reference including the encoded data. This is then decoded, the accounting data is stored and a redirection message is sent back to the client system.
Thus, U.S. Pat. No. 5,751,956 and U.S. Pat. No. 5,870,546 focus on the problems of tracking links between sites to enable measurement of advertising effectiveness, and solve this by means of a server process which creates a new form of encoded hyperlink and which subsequently decodes and processes encoded data for redirection and accounting. The only disclosure of tracking a user's navigation within a single site is a suggestion (in column 3) that access counters using CGI programs provide a reasonable manner of accounting for single-server Web page accesses. Although certain problems with CGI programs are described, there is no disclosure of the problems addressed by the present invention. Column 4 discloses a mechanism for URL redirection but it is suggested that this mechanism precludes tracking of the user's navigation, and additional problems are identified without a disclosure of solutions.
International patent application WO99/57865 similarly relates to tracking user selection of links to resources which are external of the tracking server system.
U.S. Pat. Nos. 5,712,979, 5,717,860 and 5,812,769 relate to tracking the navigation path of a user when linking from a first Web site to a second Web site. A URL received at the second Web site includes an identification of the first Web site. A destination Web page is determined for the user, and a code identifying the first Web site is attached to a Web page link associated with the destination Web page. The destination Web page including this code is transmitted to the user. This attaching of navigational history information allows determination of the previous Web site visited by the user.
None of the identified prior art discloses a solution to the problem of how to identify and log a sequence of requests to a specific Web server from a Web Browser, which differentiates between different users even if they access the Web via a common proxy server, and which does not require major modifications to the large number of current Web sites or servers which do not support cookies or dynamic encoding of URLS.