The Internet, fueled by the phenomenal popularity of the World Wide Web, has exhibited exponential growth over the past few years. It has gone from being a communication route primarily for scientists, researchers, and engineers to an essential information exchange vehicle for broad segments of the populace, including consumers, marketers, educators, children, and entertainers. Over one billion Web pages currently exist on the Internet, and over 40 million users read and interact with them. As the Internet's commercial value is recognized, numerous companies and organizations are experimenting with electronic commerce (also referred to as e-commerce), the buying and selling of goods, information, and services over the Internet (see, for example, IBM's Web site for Macy's at http://www.macys.com). And as more and more of these companies demonstrate the financial viability of electronic commerce, there has been increasing momentum to develop sites that transact business over the Web.
Any Web site owner needs to know whether the Web site effectively serves its intended purpose; that is, how many people visit the Web site, who these people are, what they want, and what they do at the site while they are there. This is particularly true of the domain of electronic commerce. The ability to analyze and understand traffic flow, the way customers navigate from page to page in a site, is critical for successful product marketing and sales.
The major source of user activity data available today is the Web server log. A Web server is the computer that sends World Wide Web documents to browsers upon request. The Web server log is a low-level, technical account of Web server activities and is generated by all commonly used Web servers. The Web server log consists of a file containing an entry for each Web page served, showing the IP (Internet Protocol) address of the client (the machine of the user who is visiting the Web site using an application); a timestamp, indicating the exact date and time on which the visit occured; the URL (Universal Resource Locator) of the requested page, the referrer URL (the URL of the page that the user clicked on to get to the current page), the browser type, and the number of bytes transferred.
Various commercial products and freeware packages (for example, Accrue's Insight, Andromedia's ARIA, e.g. Software's WebTrends, and Aquas's Bazaar Analyzer) use Web server logs to analyze Web server and user activities and generate reports. Examples of the kind of information that is typically reported are the number of visitors at a Web site during a given time period, the most and least frequently visited pages, the most frequent entry pages (the first page a user visits during a session at a Web site), the most frequent exit pages (the last page a user visits during a session at a Web site), and the visitor demographic breakdown based on IP address and browser type. The URLs in the Web server log often contain special user identifiers obtained by using “cookies”. A cookie is a piece of information shared between a user's Web browser and a Web server, originating as a message sent by a Web server to the Web browser visiting the server's site, subsequently stored in a text file on the user's hard drive, and sent back to the server each time the browser requests a page from the server. From the sequence of URLs in the Web server log and the associated cookies, it is possible to reconstruct the URL paths that individual users traverse, and from this obtain the most frequently traversed paths through the Web site.
Some Web sites in the past have used clever methods for collecting more data about user behavior by using a form of URL rewriting. They tag extra data about the user/requester to URLs of the served Web pages, so that the extra data of their analysis needs will be found in the server log. This method is usually used for adding user-related data (e.g., user-id and session-id).
Advertising banner services have developed an interesting way to measure not only who clicked on their banners, but who saw the banners. These figures not only are used to calculate the rate charged for the banners, but also the effectiveness of the banner, known as the conversion rate. The conversion rate is found by dividing click-thoughs by impressions (the number of times that the banner was served and hopefully seen). Currently the prior art is able to determine conversion rates only for specific types of adverting banners. Some Web advertising services (e.g., Real Media's Open AdStream) record impressions and click-throughs by using script programs, programs consisting mainly of strung together commands, such as those you might issue at a command line. These services add a script program to the HTML image source tag, which points to the image displayed as the advertisement. (HTML, HyperText Markup Language, is the authoring tool used to create documents on the World Wide Web. Tags are commands, generally specifying how a portion of a document should be formatted; tags can also refer to the links which allow users to move from one Web page to another. ) In addition, these services add a script program to the anchor tag, the HTML tag which acts as a link to the advertised site. The first script gets invoked when the advertised image is displayed and records its view; the second script gets invoked when a visitor clicks on the image (to visit the advertised site) and records the click.