1. Field of the Invention
This invention relates generally to computer network communications and, more particularly, to analysis of Internet traffic to a website.
2. Related Art
The best-known computer network in use today is probably the Internet. When a network user begins a communication session over the Internet, the user can request data files from an Internet-connected computer called a file server or website server. The file server provides data files, typically comprising website pages, that are requested by the user. The web pages are typically written in a type of programming code called hypertext mark-up language (html), and can be viewed or displayed through a graphical user interface (GUI) program. The network connections and collection of html files are commonly referred to as the “World Wide Web” and the GUI program to view the files is called a web browser. A collection of related files under a common Internet network domain location is commonly referred to as a website. The files available at a website can include a mixture of text, image, video, and audio data.
The Internet is being used more and more often for commercial purposes. Thus, websites have become an important means for businesses and individuals to disseminate new product and service information, public relations news, advertising, and even to transact business by receiving orders and payments over the Internet. Many Internet users who maintain websites carefully monitor the pattern of web browser requests for files, or web pages, from their website. As with most forms of advertising, a goal of providing a website is to have a large number of visitors to the website to view the commercial presentation. As a result, various software programs and monitoring services have been developed to track such file requests, which are generally referred to as web traffic, and to produce website traffic analysis reports. The Internet users who maintain websites will be referred to generally as website owners.
Some website traffic analysis tools comprise software that collects traffic statistics from the computer at which the website files are stored, or hosted. Website owners typically have, at a minimum, access to the computer on which their sites are hosted. Some website owners maintain their website on their own computer systems, but many lease storage space on a remote computer system from a web-hosting provider. Web hosting providers typically impose restrictions on the use of their systems, so that installation of third party software on the website hosting computer, or webserver, is usually precluded. This limits the appeal of website traffic analysis software that must be installed at a website hosting computer.
There are generally two common types of website analysis tools available: log-based tools and Internet-based tools.
Log-Based Traffic Analysis
Most website file servers can be configured to store information in a log file for every website page request they receive. Statistics concerning every request for a page from the website are recorded in the log file, thereby creating a record that can be analyzed to produce a website traffic report. The statistics typically include time of day, browser location, referring link, and the like. The creation of the log file will occur automatically, as html documents are requested by browsers. The log file can be analyzed periodically to process and summarize the collected statistics. The steps for retrieving html documents from a website are summarized below in Table 1.
Table 1
Summary of steps for retrieving html documents from a website with logging.
1. A web browser sends a request to a web file server for an html document.
2. The file server receives the request from the browser.
3. The file server returns the requested html document to the web browser.
4. The file server logs the transaction to a log file.
A common Internet device called a proxy server or web cache often complicates the process of creating a log file in response to html requests. A proxy server stores static web content (html data files) at a central location on the Internet to improve web performance by decreasing response times and also to decrease capacity requirements at individual web file servers by distributing the storage load. If a web browser is communicating via a proxy server, then the steps required to obtain an html document are as listed in the Table 2 summary below:
Table 2
Page request via proxy server.
1. A web browser sends a request for an html document to a website-hosting computer.
2. The html request is redirected to a proxy server.
3. The proxy server computer receives the request.
4. The proxy computer checks to see if it has the requested html document in its cache storage.
5. If the document is cached, the proxy server provides the html document back to the browser. If the page is not in the proxy's cache, it requests the page from the page-hosting web file server.
6. The web file server receives the request for the html document from the proxy server.
7. The web file server provides the html document.
8. The web file server computer records information about the transaction in a log file.
9. The proxy server caches the html document.
It should be noted that Step 5 may result in a proxy server or web cache serving the html document directly to the web browser. In this case the webserver never receives a request and thus never records the request in its log file. Some proxies may notify the web server of these requests, depending on proxy and web-server configurations.
The log-based tools that exist today examine logs to form a statistical analysis of a website. Log analysis tools are run locally on the computer or on a system to which the log files are transferred. The steps that are taken to analyze the logs are summarized below in Table 3.
Table 3
Log Analysis for Web Traffic.
1. Read the log file.
2. Analyze the log entries.
3. Form traffic analysis results.
4. Save the results to disk.
There are several drawbacks to using log analysis tools. For example, if the log analysis tools are installed locally on the website file server, then the web site owner can only use tools that are available for the platform of the hosting file server computer. These tools typically do not provide a real-time view of the website traffic data, rather, they simply analyze the traffic log data files when the tools are executed. In addition, website owners who operate very high-traffic sites often disable logging, because of inadequate computer resources to operate both the file server function and the traffic logging function. This severely reduces the effectiveness of log-based analysis. As noted above, website owners who do not host their sites may need to obtain permission to install traffic analysis software at their hosting provider. Finally, the system resources required to perform the log analysis function are such that log analysis can impede website performance, slowing down responses to web browsers who want to view the website.
If the website owner chooses to transfer the log files to another system for processing, then there are also drawbacks to that technique. For example, transferring the traffic logs is a cumbersome process in that a single day's log file for a busy web site often represents hundreds of megabytes of data. In addition, an additional system is required for receiving the transferred log files.
Internet-Based Analysis
Rather than using web traffic logs, website owners may choose to use the services of an Internet-based traffic analysis service. Such services typically are notified of each website visit through Internet communications features of the http specification and prepare statistical reports for client website owners.
The Internet-based analysis typically relies on special html code inserted into the home page of a website. Web html documents usually contain a mixture of text and graphical elements that are only retrieved by opening subsequent connections and retrieving files, such as images or frames. These graphical elements are displayed in the user's browser by retrieving the source location attribute (also referred to as the source Uniform Resource Locator, or URL) of the graphical element, opening a connection to the source location, and retrieving the data bytes that comprise the image file.
Internet-based analysis tools can count requests for a source file regardless of the actual location of the source file. Thus, a graphical element can be retrieved from a server that is in a different location from the primary website file server. The Internet-based analysis tools use embedded html code usually contains a graphical element, as described above, with a source location attribute directing a browser to the computers of the traffic analysis service. When the website home page is visited, all of the images on the home page will be requested by the browser, including the graphical element source. Headers that are automatically supplied with the browser request will reveal information about the website visitor.
The information about the website visitor is gathered and stored at the traffic analysis computers, and then the requested data, such as a graphical image, is returned to the website visitor's browser. Those skilled in the art will be familiar with so-called Internet “cookies”, which can be used to timestamp the website visitor. The timestamp value can uniquely identify a website visitor. The html code sometimes contains scripts that gather more information (such as monitor resolution, etc.) about the website visitor. This additional information is typically included with the request headers. In this way, Internet-based traffic analysis tools can provide traffic statistics in real-time by recording each request for a website page when the request occurs. That is, each website visit generates a “hit” statistic that can be accumulated from all the visitors to produce a count of requests for the website home page. These hits are interpreted as a count of the number of visits to the website.
Unfortunately, Internet-based tools only collect statistics when a page that contains the html code is viewed, so that visits to all other pages are ignored. To alleviate this problem, some traffic analysis services allow the html code to be placed on multiple pages, therefore counting a website visitor when there is a request for any page on which the website owner has placed the html graphical element code. Thus, the Internet-based analysis tools can be an improvement over log-based tools, because the Internet tools do not require the transfer of large log files for processing, but unfortunately, typical Internet-based tools do not account for, and do not collect data on, multiple visits to a website by the same visitor. In addition, conventional Internet-based traffic analysis tools do not indicate the actual sequence, or path, followed by a website visitor from page to page of a website. The website path taken by visitors can be very important in identifying the most viewed pages of a website or least popular.
From the discussion above, it should be apparent that there is a need for efficient analysis of website traffic patterns with sufficient detail so that visits to each page of a web site can be accounted for and a visitor's path through the pages of a website can be tracked, all without unduly taxing the resources of the website and traffic analysis computers. The present invention fulfills this need.